Introduction
When I was debugging a core dump issue discovered in ADS (Autonomous Driving System) testing, I found a very interesting error message in the log file:
terminate called after throwing an instance of 'boost::interprocess::interprocess_exception'
what(): Resource deadlock avoided
At first I thought there might be some bugs in my code when using the interprocess filelock in boost library. But after some investigation, I found that this error message is actually from the system.
The scenario is that we have multiple processes and each process may have multiple threads concurrently accessing our map data. The data is organized into multiple files and each file is protected by a filelock. The filelock is implemented using the interprocess filelock in boost library and each thread can grab a RLock or WLock to access the file. The RLock is shared among threads and the WLock is exclusive to the thread that grabs it.
Root cause
The root cause of this issue is that the operating system doesn’t have a deadlock detection granularity at the thread level, only at the process level.
What this means is that if we have 2 processes, each process have 2 threads (name them P1T1
, P1T2
, P2T1
and P2T2
), then:
- at timestamp t1,
P1T1
grabs a write lock on fileD1
andP2T1
grabs a write lock on fileD2
- at timestamp t2,
P1T2
attempts to grab a read lock on fileD2
andP2T2
attempts to grab a read lock on fileD1
Although there is no real deadlocks at timestamp t2, the OS thinks
there is a deadlock because it sees P1
is waiting for P2
and P2
is waiting for P1
.
Verification
Below I provide a simple C++ program to reproduce this issue. The program has 2 processes, each process has 2 threads. Each thread grabs a lock on a file and sleeps for a while. The first process grabs a write lock on file D1
and a read lock on file D2
. The second process grabs a write lock on file D2
and a read lock on file D1
.
Note that some internal implementation details are omitted for simplicity.
If we run the program, we can reproduce the Resource deadlock avoided
error message.
#include <sys/types.h>
#include <sys/wait.h>
#include <thread>
#include "flock.h" // internal filelock implementation
#include "util.h"
void routine_read_block(const std::string& block_name) {
std::string lock_name = fmt::format("{}.lock", block_name);
FileLock lock(lock_name);
lock.Readlock();
SPDLOG_INFO("read lock on {} acquired", lock_name);
sleep(3);
lock.Unlock();
SPDLOG_INFO("read lock on {} released", lock_name);
}
void routine_write_block(const std::string& block_name) {
std::string lock_name = fmt::format("{}.lock", block_name);
FileLock lock(lock_name);
lock.Writelock();
SPDLOG_INFO("write lock on {} acquired", lock_name);
sleep(5);
lock.Unlock();
SPDLOG_INFO("write lock on {} released", lock_name);
}
int main() {
SPDLOG_INFO("=========2 process, each 2 threads=========");
{
auto pid1 = fork();
if (pid1 == 0) {
// child process
std::vector<std::thread> threads;
threads.emplace_back(routine_write_block, "D1");
sleep(1);
threads.emplace_back(routine_read_block, "D2");
for (auto& t : threads) {
t.join();
}
} else {
// parent process
std::vector<std::thread> threads;
threads.emplace_back(routine_write_block, "D2");
sleep(1);
threads.emplace_back(routine_read_block, "D1");
for (auto& t : threads) {
t.join();
}
wait(NULL);
}
}
return 0;
}
Solution
The solution is to retry the lock acquisition a limited number of times when the Resource deadlock avoided
error message is thrown and random sleep for a while before retrying.