The Cache Crash

Post categories

Profile picture for Matthew Horsfall

Principal Software Developer

hero

Software Freedom Day is this week, and we thought we’d mark the occasion with a post about some of our work on open source software that we maintain. Fastmail believes in being a good internet citizen, and that belief means we participate in the development of free software, both by sharing software that we write and by contributing to the maintenance of free software that we use.

This is a highly technical post about how we solved a crash in our automated test systems. You do not need to make any changes to your Fastmail account or your email software, and you can ignore this post if it’s not interesting to you!

What happens when a mysterious bug causes all your tests to fail once or twice a week? We try to solve these mysteries, even if it takes a year.

At Fastmail we have a strong testing culture around our code. When we write a new feature or fix a bug, we include tests to show that our feature works or the bug has actually been fixed. We open a Merge Request, and GitLab CI kicks off a job to run the project’s full test suite. We also have a scheduled job that runs once a day, testing our mainline branch. This checks that the tests haven’t started failing because of changes to the outside world.

The job stands up a virtual machine, installs all of the necessary software, updates configurations, starts the various services that make up Fastmail, and then runs all of the tests. Finally, it shuts down the box and generates some downloadable artifacts with the test results. There are quite a few tests. All told, this process takes about 13 minutes. As a developer, you’re hoping to see this:

File Count: 441
Assertion Count: 22040
    -->  Result: PASSED  <--

Starting some time in early 2023, we’d occasionally have jobs fail in strange ways. We didn’t see failures around the features we were changing. Instead, we’d see a bunch of random tests would fail with the error “Signal: 6” and “Signal: 11”:

( STDERR )  job 364    free(): invalid size
< REASON >  job 364    Test script returned error (Signal: 6)
< REASON >  job 364    No plan was declared, and no assertions were made.
< REASON >  job 369    Test script returned error (Signal: 11)
< REASON >  job 369    No plan was declared, and no assertions were made.
< REASON >  job 373    Test script returned error (Signal: 11)
< REASON >  job 373    No plan was declared, and no assertions were made.

That means one process got a SIGABRT and then after that a bunch of processes got SIGSEGV. Something, somewhere, was going wrong with the software’s use of memory. “Use of memory”, though, covers a lot of what any given program is doing.

At the time, our test runs didn’t have much logging, so we couldn’t really tell what was going on, and we couldn’t easily reproduce the problem. On any given run, either it happened or it didn’t. For a while, I just assumed our VMs were occasionally dodgy somehow. The problem was rare, so we tried to ignore it and just click “Re-run job” as needed.

Soon though, it began to happen more frequently.

I added logging to our builds, and I made sure that the job’s build artifacts would always contain all the logs. With this evidence in hand, we began to see a pattern:

t1[43556]: segfault at 8 ip 00007fca57fde1ee sp 00007ffdeab5ef00
  error 4 in FastMmap.so[7fca57fda000+6000] likely on CPU 2 (core 2, socket 0)
t3[43580]: segfault at 8 ip 00007fca57fde1ee sp 00007ffdeab5ef00
t9[43685]: segfault at 8 ip 00007fca57fde1ee sp 00007ffdeab5ef00
  error 4 in FastMmap.so[7fca57fda000+6000] likely on CPU 3 (core 3, socket 0)

The crash was always in FastMmap.so! This comes from Cache::FastMmap, a Perl module of ours that provides an mmap-backed cache for use across processes.

Armed with this information, I modified our builds to collect the on-disk cache files, too. With those, I’d be able to examine what was in the cache when a crash happened. Unfortunately, the cache files were corrupt, leaving me with more questions than answers.

Over the course of months, I started taking a day here and there to dig through this problem – poring over the Cache::FastMmap code, our available logs, and our reproductions using the captured files. I tested and tossed out a lot of theories.

Eventually, I added the ability to capture the core files generated during the test runs, because I realized that when this problem triggered, there was always a SIGABRT first with the error message seen below, which previously I hadn’t thought of as significant:

( STDERR )  job 364    free(): invalid size
< REASON >  job 364    Test script returned error (Signal: 6)

It seemed pretty clear that the first crash was corrupting the file, and all of the processes using the cache file after were crashing because of that.

So after adding the core file collector, I sat there clicking “Run job” on my merge request over and over and over and over and over, trying to trigger a crash.

the author posts about the inability to reproduce the bug

Eventually, it worked… but unfortunately I had no debugging symbols in the core files, so it didn’t help. I modified the build again, including those, and then clicked “Run job” a bunch more.

Finally, I got what I needed, but this was even more confounding. The stack trace showed that the crash happened here:

#3  0x00007f8c429d6195 in mmc_do_expunge (cache=cache@entry=0x56113b842070,
    num_expunge=<optimized out>,
    new_num_slots=<optimized out>,
    to_expunge=0x56113b506210) at mmap_cache.c:800
800	  free(to_expunge);

to_expunge was a valid pointer, and nothing else had freed it yet from what I could tell by looking at the source.

This told me that most likely we were writing over memory somewhere where we shouldn’t be, and we were stomping on malloc’s internal structures. Unfortunately, these kind of bugs are really hard to track down unless you can catch them in the act using tools like Valgrind or AddressSanitizer.

I tried various things from there, including modifying the build to output every action taken against the cache file. The logs included a high-resolution timestamps, the action that was taken (fetch, store, delete), and whether it resulted in success or not.

I replayed these recordings (single threaded - not a perfect test since it didn’t imitate inter-process locking or coordinating) with Valgrind and AddressSanitizer, but I got nowhere.

I stared at the code, thought for a while longer, and then did some web searching for the “free(): invalid size” error to see if that could maybe help me pinpoint where a bad write was happening.

That led me to a post on StackOverflow that pointed toward negative indexing. After reading that, I tried a few experimental programs until I reproduced the abort:

#include <stdlib.h>

void main(void) {
  int **x = (int**)calloc(5, sizeof(int*));
  int **y = (int**)calloc(5, sizeof(int*));
  int **z = x;
  int *num = malloc(sizeof(int));
  *--z = (int *)&num;
  free(y);
  free(x);
}
$ gcc negative-array-index.c
$ ./a.out
free(): invalid size
Aborted (core dumped)

Could it be that we were underflowing an array? That is, were we accidentally writing to memory below the lower bound of an array we’d allocated? With this theory in mind, I looked through the Cache::FastMmap source to find where we might be underflowing an array, and I spotted this loop in mmc_calc_expunge():

MU32 ** copy_base_det = (MU32 **)calloc(used_slots, sizeof(MU32 *));
MU32 ** copy_base_det_in = copy_base_det + used_slots

//[... more code ...]

/* Loop for each existing slot, and store in a list */
for (; slot_ptr != slot_end; slot_ptr++) {
  MU32 data_offset = *slot_ptr;
  MU32 * base_det = S_Ptr(cache->p_base, data_offset);
  MU32 expire_on, kvlen;

  /* Ignore if if free slot */
  if (data_offset <= 1) {
    continue;
  }

  /* Definitely out if mode == 1 which means expunge all */
  if (mode == 1) {
    *copy_base_det_out++ = base_det;
    continue;
  }

  /* Definitely out if expired, and not dirty */
  expire_on = S_ExpireOn(base_det);
  if (expire_on && now >= expire_on) {
    *copy_base_det_out++ = base_det;
    continue;
  }

  /* Track used space */
  kvlen = S_SlotLen(base_det);
  ROUNDLEN(kvlen);
  ASSERT(kvlen <= page_data_size);
  used_data += kvlen;
  ASSERT(used_data <= page_data_size);

  /* Potentially in */
  *--copy_base_det_in = base_det;
}

This code allocates a new region of memory based on how many used slots we have (used_slots) and takes a pointer to the end of that region of memory:

MU32 ** copy_base_det = (MU32 **)calloc(used_slots, sizeof(MU32 *));
MU32 ** copy_base_det_in = copy_base_det + used_slots;

Then, we loop over every slot in the cache, looking for ones that have data…

for (; slot_ptr != slot_end; slot_ptr++) {
  MU32 data_offset = *slot_ptr;
  MU32 * base_det = S_Ptr(cache->p_base, data_offset);

…skipping ones that don’t…

  /* Ignore if if free slot */
  if (data_offset <= 1) {
    continue;
  }

…and finally copying over ones that do have data to the new region of memory, starting at the end of the memory region and working towards the front:

  *--copy_base_det_in = base_det;

(Later, copy_base_det ends up in to_expunge seen above.)

In order for this loop to underflow, the cache would have to have more slots with data in them than used_slots accounts for.

The only way I could see this happening was if two processes wrote to the cache at the same time, corrupting its accounting. That, though, would mean locking had failed somehow…

To chase that down, I modified the build to log when a lock was acquired, when a write happened, and when the lock was released. I clicked “Run job” wildly, like I was playing the world’s worst Cookie Clicker game… and eventually I got another hit.

This time, there was a smoking gun:

1739384152.734390 [32464] locked page 3: p_offset: 786432 pns: 179 pfs: 115 size: 262144
1739384152.737327 [32928] locked page 3: p_offset: 786432 pns: 179 pfs: 115 size: 262144
1739384152.738589 [32928] unlocked page 3 (with changes): pns: 179 pfs: 114
1739384152.738776 [32464] unlocked page 3 (with changes): pns: 179 pfs: 115

Here, pns is the number of slots, and pfs is the number of free slots.

Above, we see that process 32464 locked the cache, and then before 32464 could write to the cache, process 32928 also locked the cache! That’s supposed to be impossible! Then, the second process wrote an entry to the cache, using up another slot, so it decremented the free slot count. And then, the first process wrote out its changes. Those didn’t include the use of any new slots, and so it set the free slot count back up to 115!

This is what caused mmc_calc_expunge to underflow – it allocated memory for 64 slots (179 total slots minus 115 free slots) but ended up seeing 65 slots with data that it needed to copy over!

Using the available logs on the machine, I identified the two processes, and found that one of them was a test that I had long suspected was the culprit.

That test uses IO::Async::Process to fork a child, then run some Perl code in that child. It turns out that IO::Async::Process closes all file descriptors except STDIN, STDOUT, and STDERR when forking. This is a surprising thing to do when forking – often a parent might intentionally leave file descriptors open for a child to continue using.

A file descriptor is just a number representing an open file or socket (like a network connection).

When Cache::FastMmap first starts up, it opens the cache file for reading/writing:

int fh = open(cache->share_file, O_RDWR);
if (fh == -1) {
  return _mmc_set_error(cache, errno, "Open of share file %s failed", cache->share_file);
}
cache->fh = fh;

It then maps that file into memory:

cache->mm_var = mmap(0, cache->c_size, PROT_READ | PROT_WRITE, MAP_SHARED, cache->fh, 0);

Later, when it needs to write out changes to a specific page in the cache, it locks that region of mapped memory by locking specific ranges of bytes within the file using the file descriptor number:

/* Setup fcntl locking structure */
lock.l_type = F_WRLCK;
lock.l_whence = SEEK_SET;
lock.l_start = p_offset;
lock.l_len = cache->c_page_size;

/* Lock the page (block till done, signal, or timeout) */
lock_res = fcntl(cache->fh, F_SETLKW, &lock);

If another process using Cache::FastMmap with the same file tries to write changes, it will block trying to get the same lock until the current process is finished and unlocks the file.

However, when IO::Async::Process closes all open file descriptors, cache->fh becomes invalid, since that number no longer belongs to any open file or socket. Then, later, if the child process opens up any new files or sockets, that number could be reused(!), so when Cache::FastMmap attempts to lock the cache in the child, it succeeds, but locks the wrong file!

The memory mapped region is not tied to the file descriptor after having been mapped, so any changes written to memory would still overwrite the cache data, and stomp on whatever work any other process is doing at the same time.

We fixed all this by ditching the “fork and run more Perl in the child” and replacing it with “fork and exec a brand new process”. With exec, we replace our running program entirely, so we no longer have the mmaped regions available. We’ve been segfault free ever since! We also added code to a new open source release of Cache::FastMmap, to try to prevent anybody else from going through all this.

What’s really frustrating about this is that I almost cracked it months ago. I had suspected that the fork-and-close-children behaviour of IO::Async::Process was the problem, but when I tested some code manually to trigger the bug, I always got “invalid file descriptor” in the children during locking of the mmaped region… because my tests weren’t opening up new file descriptors. The lock file descriptor wasn’t valid and so the mmaped region couldn’t be be written over.

Alas.