Deterministic simulation testing is fast becoming the gold standard for how mission critical software is tested. Deterministic simulation testing was first popularized by the FoundationDB team who spent 18 months building a deterministic simulation framework for their database before ever letting it write or read data from an actual physical disk. The results speak for themselves: FoundationDB is widely considered to be one of the most robust and well-tested distributed databases, so much so that Kyle Kingsbury (of Jepsen fame) refused to test it because their deterministic simulator already stress tested FoundationDB more than the Jepsen framework ever could.
The WarpStream team utilized FoundationDB heavily at Datadog when we built Husky, Datadog’s columnar storage engine for event data. Over the course of our careers, our team has operated (and been on-call for) almost every database on the market: M3DB, etcd, ZooKeeper, Cassandra, Elasticsearch, Redis, MongoDB, MySQL, Postgres, Apache Kafka and more. In our experience, FoundationDB stands in a league of its own in terms of correctness and reliability due to its early investment in deterministic simulation testing.
A more recent example of a database system that leverages this approach to testing is TigerBeetle, a financial transactions database, that uses deterministic simulation testing to build one of the most robust financial OLTP databases available today.
When we were designing WarpStream, we knew that it wouldn’t be enough to just replace Apache Kafka with something cheaper and easier to operate. Kafka is the beating heart of many companies most critical infrastructure, and if we were to stand any chance of convincing those organizations to adopt WarpStream, we’d have to compress 12+ years of production hardening into a much shorter time frame. We accelerated this process with our architectural decision to rely on object storage as the only storage in the system, bypassing many of the tricky problems of ensuring data durability, availability, and replication at scale. Still, the fact that WarpStream leverages object storage is only a small part of ensuring the correctness of the overall system.
When we first heard about Antithesis, we could hardly contain our excitement. Antithesis has created the holy grail for testing distributed systems: a bespoke hypervisor that deterministically simulates an entire set of Docker containers and injects faults, created by the same people who made FoundationDB. For a group of gray-haired distributed systems engineers, seeing Antithesis in action felt like a tribe of cavemen stumbling upon a post-industrial revolution society. As we spoke more to the Antithesis team, an idea began to crystallize: we could use Antithesis to deterministically simulate not only WarpStream, but our entire SaaS!
WarpStream was built differently than most traditional database products. It was designed from day one with a true data plane / control plane split. There are two primary components to WarpStream: First, the Agents (data plane) that act as “thick proxies” and expose the Kafka protocol to clients. The Agents also take care of all communication with object storage, layering in batching and caching to improve performance and keep costs low.
Second is the WarpStream control plane which has two major components:
The metadata store only has two dependencies:
The SaaS software adds one additional dependency: a traditional SQL database for managing users, organization, API keys, etc. Looking at WarpStream’s minimal dependencies, we thought, why not test its entire customer experience, from initial signup to running Kafka workloads?
We created a docker-compose file that contains the following components:
With the help of the Antithesis team, we wrote a test workload that started all of those services, signed up for a WarpStream account, created a virtual cluster, and then began producing and consuming data. The workload was carefully structured so that we could assert on a variety of different important properties that WarpStream must maintain at all times.
The test workload consists of multiple producers that are each assigned a unique ID and write records to a small set of topics. These producers synchronously write a few small JSON records that contain the producer’s ID, a counter (a monotonic sequence number for that producer), and a few other properties. We repeat the same components as the record’s key, value, and in a header to ensure we never shuffle those around accidentally. The consumer side of the workload polls all the topics and all the partitions and asserts that:
The consumers store all of the records for each polling iteration and can assert that a record at offset X in the previous poll still exists in a future poll. This ensures that WarpStream doesn't lose or reorder data as e.g. background compaction runs to reorganize the cluster’s data for more efficient access.
These assertions address many of the classes of bugs found in previous Jepsen tests of Apache Kafka and other Kafka-compatible systems. For example, prior Jepsen tests have caught bugs like:
At this point you might be scratching your head a little bit and wondering: “What’s the big deal here? Isn’t this just a really fancy integration test!?”. Yes and no. Before we started using Antithesis, WarpStream already had a pretty robust set of stress tests we called the “correctness tests”.
These tests do essentially everything we just described, but in a regular CI environment. Our correctness tests even inject faults all over the WarpStream stack using a custom chaos injection library that we wrote. These tests are incredibly powerful, and they caught a lot of bugs. We would go as far as saying that investing deeply in those correctness tests is one of the main reasons that we were able to develop WarpStream as efficiently as we did.
Just like our existing correctness tests, the Antithesis hypervisor automatically injects faults, latency, thread hangs, and restarts into the workload. However, unlike our correctness tests, the Antithesis hypervisor is really smart and automatically fuzzes the system under test in an intelligent way.
Antithesis automatically instruments your software to measure code coverage and build statistics about the execution frequency of each code path. This enables Antithesis to detect “interesting” behavior in the test (such as infrequent code paths getting exercised, or rare log messages being emitted).
When Antithesis detects interesting or rare behavior, it immediately snapshots the state of the entire system before exploring various different execution branches. This means that Antithesis is much better at triggering rare or unlikely behavior in WarpStream than our existing correctness tests were.
Also, since Antithesis runs the entire software stack in a deterministic simulator, they can actually run the simulation at faster than wall clock time. Similar to FoundationDB, WarpStream makes heavy use of timers and batching to improve performance. Anytime a WarpStream Goroutine does the equivalent of time.Sleep(), the Antithesis hypervisor doesn’t actually have to wait. On top of that, the Antithesis hypervisor explores code branches concurrently. All of this adds up in a meaningful way such that Antithesis can cost effectively compress years of stress testing into a much shorter time frame.
It’s hard to over-emphasize just how transformative this technology is for building distributed systems. For all intents and purposes, it really does feel like a time-traveler arrived from 20 years in the future and gave us their state of the art software testing technology. Of course, it’s not actually magic. Antithesis is the result of dozens of the smartest software engineers, statisticians, and machine learning experts pouring their heart and souls into the problem of software testing for 5 years straight. But to us mere mortals, it does feel a lot like magic.
Let’s look at a few example runs that Antithesis generated for us.
Antithesis ran the WarpStream workload for 6 wall clock hours, during which it simulated 280 hours of application time. The graph shows that it took about 160 “application hours” for Antithesis to “stall” and stop discovering new “behaviors” in the WarpStream workload. This means that running the tests for longer than 160 hours has diminishing returns, and instead we should invest in making the test itself more sophisticated if we want to exercise the codebase more. Great feedback for us!
But think about that for a moment: even after 140 hours of injecting faults, randomizing thread execution, automatically detecting that something interesting / rare had happened and intentionally branching to investigate further, Antithesis was still “discovering” new behaviors in WarpStream. We could hire a 100 distributed systems engineers and make them write integration tests for an entire year, and they probably wouldn’t be able to trigger all the interesting states and behavior that a single Antithesis run covered in 6 hours of wall clock time.
As just one example of how powerful this is, on the first day we started using Antithesis it caught a data race in our metrics instrumentation library that had been present since the first month of the project.
Our correctness tests had run in our regular CI workflows for literally 10s of thousands of hours by then, with the Go race detector enabled, and not once ever caught this bug. Antithesis caught this bug in its first 233 seconds of execution.
A data race in the instrumentation library isn’t that exciting, though. What about an extremely rare data loss bug that is the result of both a network failure and a race condition? That’s more exciting!
To minimize the number of S3 PUTs that WarpStream users have to pay for, the Agents buffer Kafka Produce requests from many different clients in-memory for ~250ms before combining the batches of data into a single file and flushing it to object storage.
In some scenarios, like if write throughput is high, there will be multiple outstanding files being flushed to object storage concurrently. Once flushing the files succeeds, committing the metadata for the flushed files to the control plane can be batched to reduce networking overhead. This is implemented using a background Goroutine that periodically scans the list of “flushed but not yet committed” files.
While refactoring the Agent to add speculative retries for flushing files to object storage, we subtly broke the error handling on this path so that, for a very brief window of time, a file which failed to flush would be considered successful and ready to commit to the control plane metadata store. In program order (i.e. the linear flow of the code, ignoring concurrency) this window where the background Goroutine that commits metadata would see the successful file would be nearly impossible to squeeze into. This background Goroutine only polls for successful files every five milliseconds, and the time between the two state transitions in the common case would be less than a microsecond!
This bug is the manifestation of two unlikely events: a file failing to flush and a specific thread interleaving that should be extremely rare in practice. Despite how unlikely these events are to occur together, on a long enough time-scale, this bug would have resulted in data loss at some point.
Instead, thanks to Antithesis’ powerful fuzzer and fault injector, this rare combination of events happened roughly once per wall clock hour of testing. We’d been running a build with this bug in our staging environment and obviously did not encounter that bug at all, let alone once per hour, as it would’ve immediately been noticed when a future background compaction failed due to the missing file in object storage. We’ve since fixed the regression in the code such that the invalid, temporary state transition cannot occur.
The obvious question you might be asking yourself at this point is: Why use Antithesis instead of a traditional Jepsen test? It’s a good question, and one we asked ourselves before embarking on our journey with Antithesis.
We’re big fans of Jepsen and have consumed almost every published report. However, after speaking with the Antithesis team and spending a few months integrating with it, we feel strongly that deterministic simulation testing with tools like Antithesis is a much more robust and sustainable path forward for the industry. Specifically, we think that the Antithesis’ approach is better than Jepsen’s for a few reasons:
We’re just getting started with Antithesis! Over the coming months we plan to work with the Antithesis team to expand our testing footprint to cover additional functionality like:
If you’d like to learn more about WarpStream, please contact us, or join our Slack!