Debug Resource Deadlock Avoided Error

Introduction

When I was debugging a core dump issue discovered in ADS (Autonomous Driving System) testing, I found a very interesting error message in the log file:

terminate called after throwing an instance of 'boost::interprocess::interprocess_exception'
  what():  Resource deadlock avoided

At first I thought there might be some bugs in my code when using the interprocess filelock in boost library. But after some investigation, I found that this error message is actually from the system.

The scenario is that we have multiple processes and each process may have multiple threads concurrently accessing our map data. The data is organized into multiple files and each file is protected by a filelock. The filelock is implemented using the interprocess filelock in boost library and each thread can grab a RLock or WLock to access the file. The RLock is shared among threads and the WLock is exclusive to the thread that grabs it.

Root cause

The root cause of this issue is that the operating system doesn’t have a deadlock detection granularity at the thread level, only at the process level.

What this means is that if we have 2 processes, each process have 2 threads (name them P1T1, P1T2, P2T1 and P2T2), then:

at timestamp t1, P1T1 grabs a write lock on file D1 and P2T1 grabs a write lock on file D2
at timestamp t2, P1T2 attempts to grab a read lock on file D2 and P2T2 attempts to grab a read lock on file D1

Although there is no real deadlocks at timestamp t2, the OS thinks there is a deadlock because it sees P1 is waiting for P2 and P2 is waiting for P1.

Verification

Below I provide a simple C++ program to reproduce this issue. The program has 2 processes, each process has 2 threads. Each thread grabs a lock on a file and sleeps for a while. The first process grabs a write lock on file D1 and a read lock on file D2. The second process grabs a write lock on file D2 and a read lock on file D1.

Note that some internal implementation details are omitted for simplicity.

If we run the program, we can reproduce the Resource deadlock avoided error message.

#include <sys/types.h>
#include <sys/wait.h>

#include <thread>

#include "flock.h" // internal filelock implementation
#include "util.h"

void routine_read_block(const std::string& block_name) {
  std::string lock_name = fmt::format("{}.lock", block_name);
  FileLock lock(lock_name);
  lock.Readlock();
  SPDLOG_INFO("read lock on {} acquired", lock_name);
  sleep(3);
  lock.Unlock();
  SPDLOG_INFO("read lock on {} released", lock_name);
}

void routine_write_block(const std::string& block_name) {
  std::string lock_name = fmt::format("{}.lock", block_name);
  FileLock lock(lock_name);
  lock.Writelock();
  SPDLOG_INFO("write lock on {} acquired", lock_name);
  sleep(5);
  lock.Unlock();
  SPDLOG_INFO("write lock on {} released", lock_name);
}

int main() {
  SPDLOG_INFO("=========2 process, each 2 threads=========");
  {
    auto pid1 = fork();
    if (pid1 == 0) {
      // child process
      std::vector<std::thread> threads;
      threads.emplace_back(routine_write_block, "D1");
      sleep(1);
      threads.emplace_back(routine_read_block, "D2");
      for (auto& t : threads) {
        t.join();
      }
    } else {
      // parent process
      std::vector<std::thread> threads;
      threads.emplace_back(routine_write_block, "D2");
      sleep(1);
      threads.emplace_back(routine_read_block, "D1");
      for (auto& t : threads) {
        t.join();
      }
      wait(NULL);
    }
  }
  return 0;
}

Solution

The solution is to retry the lock acquisition a limited number of times when the Resource deadlock avoided error message is thrown and random sleep for a while before retrying.

References

https://gist.github.com/harrah/4714661