Tracking down rare cache consistency bugs using distributed consistency tracing at scale

Large-scale distributed systems frequently struggle with cache consistency. It’s even called the hardest problem in computer science. In read-heavy systems such as Facebook’s TAO, it’s even harder to detect rare types of inconsistencies and generate actionable information due to the sheer scale. Replica comparison-based checking can help with detection, but it doesn’t provide actionable insight into the actual cause of an inconsistency. We need a way to trace requests within the system to pinpoint
where and how cache inconsistencies are introduced.

We don’t know which operations against our cache will introduce inconsistencies, but we also can’t practically trace every operation in a high-throughput system. The key insight is that there are a limited number of cases where cache is actually allowed to change. While read operations often make up the vast majority of cache traffic, they can still cause cache fills and bugs could still cause inconsistencies. However, reads of recently mutated data are much more likely to be of interest. Can we trace reads to recently mutated data?

In a read-optimized write-through cache such as TAO, we care when a write occurs and invalidations are sent to other replicas; we care when a read on a recently written key results in a cache fill; we care when an at-or-after versioned read forces the system to refill the cache.

Our system samples write traffic and tags the write as traced as soon as the write is sent. Then, any RPCs involving that key are traced for a period of time, including asynchronous invalidation or replication messages, subsequent cache fills occur on reads, and even at-or-after versioned read requests. The system can trigger traces even if the write has not yet been asynchronously replicated to the replica being queried yet.

While our consistency checker is external to TAO, the actual tracing implementation is integrated directly into the binary. Because of this, we can obtain detailed traces of exactly which code paths executed during a request. We can even perform basic consistency checking while tracing the request execution to verify invariants, such as a new cached version being higher than the old cached version. We’ve been able to use this distributed trace information to pinpoint bugs down to the exact line of code.

We’ve also used distributed consistency tracing extensively to verify asynchronously updated global secondary indexes. Secondary indexes pose additional challenges for inconsistency detection because index updates are filtered out, transformed, resharded, and more. This makes it impossible to perfectly verify them programmatically as any consistency checker must also filter, transform, and reshard index updates from the canonical source. If we had perfect code that could figure out whether there should be
a row in the index, we could just use that as the indexing code and we’d never have inconsistencies. Further, our secondary indexes are asynchronously updated and don’t support point in time reads, making it impossible to compare the index to the canonical source.

So in addition to building consistency checkers for our secondary indexes (which are inherently incomplete), we rely on distributed consistency tracing. The observability we get from tracing lets a human verify the accuracy of the secondary indexes when debugging production issues, much the same way as MySQL’s extensive integration testing provides verification of its secondary indexes despite the impossibility of perfect consistency checking.

Consistency is a form of reliability

Consistency and reliability are usually considered to be in tension because of the CAP theorem. And when working on a distributed system, it’s normal to get focused on trying to improve service reliability. Sometimes, we can take shortcuts that will increase reliability at the expense of returning stale results more often.

But consistency shouldn’t be seen as opposing reliability. In fact, I think they’re very similar. If your service has 6 9s of reliability (number of successful requests / number of total requests) but only 3 9s of consistency (number of consistent requests / number of successful requests), do you really have 6 9s of reliability? After all, you’re getting some of those extra 9s just by returning a wrong answer.

With my definition of consistency (how many successful requests are consistent), it’s arguably better to prioritize adding more 9s of consistency than reliability.

New blog to capture my distributed systems learnings

My coworker Lu Pan has a great blog where he writes about his systems and programming learnings. Many times, after we’ve finished debugging a complex issue or having an interesting theoretical discussion, I check his blog and find a post that distills it all down to its key takeaways.

I’m probably not as good as Lu at writing things so succinctly, but his blog inspired me to try it.