Public Benchmarks and TCO Analysis

Mar 5, 2024
Richard Artoul

Introduction

Benchmarking databases – and maintaining fairness and integrity while doing so – is a notoriously difficult task to get right, especially in the data streaming space. Vendors want their systems to produce mouth watering results, and so unnatural configurations divorced from customer realities (AKA “vanity” benchmarks) get tested, and it's ultimately the end-user that is left holding the bag when they realize that their actual TCO is a lot higher than they were led to believe.

With all of that in mind, when we set out to publish WarpStream’s initial public benchmarks, we decided that we would do a few things differently than everybody else.

  1. We didn’t try to benchmark another vendor’s implementation. Instead, we focused exclusively on measuring WarpStream’s capabilities and highlighting where it shines, and where it falls short. That said, we did use the standardized OpenMessaging benchmark framework so that readers can compare our results to other vendor’s public benchmarks.
  2. We focused on cost and explicitly compared WarpStream’s TCO to the TCO of self-hosted Apache Kafka (but not any particular vendor’s pricing). Ultimately, database benchmarks are all about costs. When a vendor tells you: “our system was able to run this benchmark with only 3 nodes, while the other system required 9”, what they’re actually trying to convey is that their system is cheaper and easier to run than the alternatives. For the same reason, we also included some light commentary (with evidence) about how easy WarpStream is to operate and scale.
  3. We tested more difficult workloads than the standard “vanity” benchmark you’re used to seeing.  In addition to the standard vanity benchmark that is used in almost every other data streaming benchmark (1 GiB/s writes, 1GiB/s reads, 4 producers, 4 consumers, 1 topic, 288 partitions), we also provide benchmark results and TCO figures for a variety of significantly more difficult workloads that are designed to mirror the messy and unpredictable reality of our customers’ production environments. Don’t get us wrong here, WarpStream performed exceptionally well in the vanity benchmark and it would’ve made our lives way easier if we had only published those results. However, we couldn’t in good conscience use that workload as an indication to our customers about how WarpStream would perform in their production environments.

Production is a messy place, and therefore it deserves messy benchmarks. As a result, you’re going to see a few things in this benchmark post that you’re probably not used to seeing. In addition to the standard vanity benchmark, we also:

  1. Scaled the cluster up and down while the benchmark was running
  2. Increased the consumer fan out from 1x to 3x, tripling the amount of reads that WarpStream has to serve
  3. Increased the partition count 16x from 288 to 4,608
  4. Increased the number of producer clients from 4 to 256
  5. Increased the number of consumer clients from 4 to 192
  6. Switched from the standard “sticky” partitioner, to the much more difficult “round-robin” partitioner

WarpStream handled the vanity benchmark, plus all of the more difficult scenarios above, with just 6 m6in.4xl instances, and more importantly, more than 4x less TCO than self-hosting Apache Kafka (including WarpStream's control plane / license fee). Of course, as they say in software, there's no free lunch. WarpStream did exhibit higher latency than the equivalent Apache Kafka setup would have (owing to its object store based design), but we think that’s the right trade-off for the vast majority of workloads.

We’re really excited about how well WarpStream performed in all these tests, and we hope that you’ll dive into all the gory details with us in the rest of the post!

The Vanity Benchmark

The standard Apache Kafka benchmark involves writing 1 million 1 KiB messages/s to the underlying system with the Open Messaging Benchmark Framework, spreading the traffic over 288 partitions in a single topic. Traffic is driven using 4 producer and consumer clients, and consumer fan out is 1x.

This workload is expressed using the following yaml:

name: vanity topics: 1 partitionsPerTopic: 288 messageSize: 1024 useRandomizedPayloads: true randomBytesRatio: 0.5 randomizedPayloadPoolSize: 1000 subscriptionsPerTopic: 1 consumerPerSubscription: 64 producersPerTopic: 64 producerRate: 1000000 consumerBacklogSizeGB: 0 testDurationMinutes: 1600

The value for randomBytesRatio varies from benchmark to benchmark, but we selected 0.5 as it results in the lowest compression ratio, and as a result, the most difficult benchmark. We also increased the base number of producers and consumers from 4 to 64 which also makes the workload harder because it reduces the effectiveness of batching, and makes the system have to handle a higher number of batches, requests, and connections. We felt like this was the bare minimum to make the results of the vanity benchmark interesting.

We also configured the benchmark framework with the following driver configuration:

producerConfig: | linger.ms=25 batch.size=100000 buffer.memory=128000000 max.request.size=64000000 compression.type=lz4 consumerConfig: | auto.offset.reset=earliest enable.auto.commit=true max.partition.fetch.bytes=100485760 fetch.max.bytes=100485760

Finally, we used m6in.4xl instances to run the WarpStream Agents and m6in.8xl instances to run the openmessaging benchmark (intentionally overprovisioned to avoid any bottlenecks in the clients). We performed zero operating system, kernel or EC2 tuning. We used stock pre-made AWS ECS images and ran the WarpStream Agents with the default configuration.

WarpStream handles this workload without breaking a sweat, transmitting 1 GiB/s of symmetric write/read throughput using just 6 m6in.4xl instances.

Write throughput, rolled up to 20s intervals.

The instances were only lightly loaded, using ~5 out of their 16 available cores at any given moment.

They also only used a small fraction of their 64 GiB of RAM.

Of course, it’s not all sunshine and roses. WarpStream does make one trade-off that most other data streaming systems don’t: higher latency. As expected, WarpStream exhibited higher latency in the benchmark than comparable systems like Apache Kafka.

WarpStream intentionally makes this trade-off for a very specific reason: to reduce cloud infrastructure costs and operations by almost an order of magnitude. We think this is the right trade-off for the vast majority of real Kafka workloads. In actuality, WarpStream can achieve much lower publisher and end-to-end latency if it's configured for lower latency (which results in higher costs) however, we wanted to first demonstrate the kind of latency users can expect with WarpStream's default configuration which optimizes for low costs.

What does a little latency buy you?

Elasticity and Operational Simplicity

We just made a bold claim: WarpStream reduces operational burden by an order of magnitude compared to other Kafka implementations. Let’s prove it!

One of the most common Kafka cluster operations users carry out is scaling the cluster in and out. For fun, we tried scaling the number of WarpStream Agents down from 6 nodes to 3 nodes while the benchmark was still running. It held up just fine.

Downscaling took about a minute, and all we had to do was update the replica count in our ECS service from 6 to 3.

That’s it. That’s all we did. No 3rd party solution, special operations, custom scripts, or anything else. We just asked ECS to reduce the replica count, and then waited. If we were using Kubernetes, we would have run a command like:

kubectl scale deployment warpstream-benchmark --replicas 3

No partition rebalancing, no waiting for data to be replicated. Just pure stateless scaling like a traditional web application. In fact, if you look closely at the graphs, you’ll see that our EC2 auto-scaling group actually decided to rotate one of the three remaining nodes while the downscale was happening. For a period of time, we were running the entire workload on just 2 nodes!

Of course, latency was higher with that much load on the nodes.

Once we scaled the Agents back to 6 replicas in ECS, the system redistributed the load automatically.

4x Lower TCO

Now, let's talk about costs.

The nodes in this benchmark don’t have any attached SSDs, and we didn’t provision any EBS volumes. The entire cluster is truly stateless (because WarpStream uses object storage as the primary and only storage layer), and as a result we’re able to use relatively small and cost-effective instances. Compared to the equivalent instance types with local disks or EBS volumes that Apache kafka would require to support this workload, the WarpStream hardware costs are quite minimal. Specifically, 6 m6in.4xl instances will cost ​6 * ​$0.7017 * 24 * 30 == $3,031/month. That’s a worst case scenario estimate too that assumes > 50% overprovisioning. In reality, we could achieve much lower costs by leveraging WarpStream’s ability to auto-scale “just in time” like a traditional web server. Note that this value was calculated using AWS 1-year no upfront reserved pricing, which is the same publicly discounted rate that we'll use for comparison with Apache Kafka as well.

However, there are a few other costs to consider. The most significant ones are the S3 object storage requests.

Total S3 GET costs: $0.0000004/GET * 490 * 60 * 60 * 24 * 30 = $508/month

Total S3 PUT costs: $0.000005/PUT * 138 * 60 * 60 * 24 * 30 = $1,788/month

Total cloud infrastructure costs: $5,327/month

WarpStream also charges a fee to use our cloud control plane. For 1GiB/s of throughput, our public pricing indicates we’d charge $7,541/month. Therefore the total cost of ownership for this workload would be: $12,868/month.

More important than what you are paying for, though, is what you’re not paying for: inter-zone networking. WarpStream’s unique architecture ensures that even for a workload sustaining 1GiB/s of symmetric writes and reads, inter-zone networking fees are ~$0.

WarpStream accomplishes this by ensuring that producers are automatically aligned to Agents running the same availability zone as them. This is possible because WarpStream has no topic-partition leaders, and any Agent can write data for any topic-partition. Consumers are also always zonally aligned to an Agent running in the same availability zone as them, similar to how Apache Kafka leverages the follower-fetch functionality to achieve the same effect.

Finally, there is no inter-AZ networking to replicate data between the Agents because WarpStream uses object storage as the storage layer and the inter-AZ network, and networking between VMs and object storage is free in all cloud environments.

Apache Kafka and other similar systems can avoid inter-AZ networking between consumers and brokers using the follower-fetch functionality, but it’s impossible for them to avoid the inter-AZ networking fees between producers and brokers, as well as the inter-AZ fees for replicating the data between the brokers.

Diagram showing inter-zone networking fees incurred by Apache Kafka between producers and topic-partition leader, as well as between topic-partition leader and followers for replication.

WarpStream bypasses these networking fees entirely by using object storage as the storage layer and the network layer. Of course, instead of inter-zone networking fees WarpStream incurs object storage API fees, but as you'll see in the next section, those fees pale in comparison to the inter-zone networking fees.

With WarpStream, consumers and producers are zonally aligned. In addition, replication between zones is handled by the object storage layer which avoids inter-zone networking fees.

To make this concrete, let’s figure out how much traffic is being transmitted from the OpenMessaging producers to the WarpStream Agents.

490 MiB/s of LZ4 compressed throughput. This makes sense, because we intentionally configured our benchmark so that the data will only compress at most 2x to make the workload harder on the WarpStream Agents. This is also what we observed when we reviewed other vendor’s vanity benchmarks.

However, let’s be really generous and assume the compressed throughput in reality is only 200 MiB/s because the data is highly compressible and our producer clients are well tuned such that we achieve 5x compression over the wire.

Inter-zone networking costs $0.02/GiB ($0.01 / GiB in the egress zone, and $0.01 / GiB in the ingress zone). In addition, assuming that writes are evenly distributed, 2/3rds of the time the Apache Kafka partition-leader would be in a different availability zone than the producer client. Therefore, just to transmit data from the producer client to the topic-partition leader will cost ((200/1000) * 0.02 * ⅔) * 60 * 60 * 24 * 30 == $6,912/month in inter-AZ networking fees alone if we’re running Apache Kafka.

Of course, the produced data also has to be replicated to the two remaining availability zones before it can be considered durable. As a result, Apache Kafka would incur another ((200/1000) * 0.02 * 2) * 60 * 60 * 24 * 30 == $20,376/month in inter-AZ replication fees. With WarpStream, this number is ~$0.

This means that the equivalent Apache Kafka setup would cost $27,288/month in inter-AZ networking fees alone in the best case scenario where follower-fetch has been properly configured. In reality, once we add in EC2 and EBS fees, the actual cost is closer to $54,591/month all-in for only 3 days of retention and that’s not including any fees to a vendor to manage all of this for you or provide support. Increasing the retention from 3 days to 7 would sky-rocket the costs even further to $94,134/month, but we’ll set that aside for now.

Lets put everything in perspective now. Self hosting Apache Kafka will cost $54,591/month in cloud infrastructure costs alone. WarpStream will cost $5,327/month in cloud infrastructure costs, more than 10x cheaper, and even after we include the WarpStream license / control plane fees, it only costs $12,868/month which is still more than 4x cheaper for a solution that requires virtually no management in the first place.

All of us at WarpStream have spent most of our careers working closely with organizations running some of the largest Kafka deployments in the world, and for all but the lowest throughput workloads, inter-AZ networking usually makes up the bulk of the actual total cost - even taking into consideration the substantial discounts provided by the cloud providers!

This is the dirty secret of the data streaming industry, and an uncomfortable truth that many vendors will go out of their way to obfuscate from you. Unfortunately, most of the time they get away with it because attributing network costs to specific services or workload in your environment is incredibly difficult, so this cost is hidden from you. However, next time you’re evaluating a TCO comparison, make sure you check whether inter-AZ networking fees are taken into consideration. 

In summary, we demonstrated with this benchmark that WarpStream is at least as efficient per core as Apache Kafka, but even if WarpStream required 5x more hardware than Apache Kafka (it doesn’t), it would still be more cost effective than self-hosting Apache Kafka simply by eliminating networking fees even after taking the WarpStream control plane fee into account.

Taking it further

We started this blog post by saying that we don’t think the standard “vanity” benchmark is a good measure of real world workloads. We already made the vanity workload more difficult by increasing the number of producer and consumer clients from 4 to 64. Now, let’s go one step further and increase the consumer fan out from 1x to 2x, and then from 2x to 3x. This simulates a more realistic scenario where Kafka is being used to “fan out” data to multiple different consumers that are each processing the data in different ways.

In total, increasing the consumer fan out from 1x to 3x increased CPU usage by 42% and S3 GET requests by 22%.

Latency increased slightly, but remained well within acceptable levels.

The cluster is now serving 1GiB/s of writes, and 3 GiB/s of reads for a total of 4GiB/s of traffic, still using just 6 m6in.4xls. Due to how the OpenMessaging benchmark framework works, increasing the consumer fan out from 1x to 3x also increased the total number of consumer clients from 64 to 192. However, we also wanted to see how the cluster would tolerate an increase in the number of producer clients, so we increased the number of producers from 64 to 256.

This had minimal impact on CPU and latency.
The number of S3 PUTs was also almost completely unimpacted.

Finally, we decided to stress the workload even further by increasing partition counts. We quadrupled the number of partitions from 288 to 1,152. This had minimal impact on CPU usage, no impact on S3 PUT or GET operations, and increased end-to-end latency slightly.

Finally, we decided to quadruple the partition count again from 1,152 to 4,608, and once again the CPU usage remained mostly the same, and the number of S3 PUT and GET requests were completely unchanged. End-to-end latency did increase slightly, but stayed well below our 2s target at the P99.

One final workload

Before we concluded this round of benchmarks, there was one final test we wanted to run. This test represents the most difficult workload for systems like WarpStream and Apache Kafka, and all it involves is a one line change to the workload YAML:

keyDistributor: "KEY_ROUND_ROBIN"

What’s so special about this line? Well, the records produced by the OpenMessaging benchmark don’t have any keys associated with them. This means that the Java client will use an extremely efficient “sticky” partitioning strategy where it picks a partition at random, adds all produced records to a batch for that partition until the batch size is larger than or exceeds the configured value for batch.size, and then it switches to another partition and starts creating a new batch.

This means that with a batch.size of 131,072 bytes the 1GiB/s workload only needs to generate 7,629 batches/s to satisfy the desired throughput. That’s just not that many batches. 

This makes the Kafka broker’s job easy: batches will compress well which keeps network and disk throughput low, and also, the disk access patterns will be well behaved. The brokers will just be flushing already well-sized batches to the current mutable segment for each topic-partition. Unlike Apache Kafka, WarpStream doesn’t have per topic-partition files. However, the rate of written batches does still impact Warpstream’s performance in a few ways:

  1. It increases the amount of load on the metadata store / control plane because more batches have to be tracked and managed.
  2. It increases CPU usage because more batches usually means worse compression, which involves copying more bytes around the network.
  3. It increased end-to-end latency, because the Agents have to read more batches and perform more IO to transmit data to consumers.

Now that we understand the implications of the partitioning strategy, let's go back to the choice of the round robin partitioner. This is the absolute worst case scenario for any Kafka implementation because it means that the producer client will round-robin the records amongst all the partitions.

Let's make this concrete with an example. Imagine 10 1KiB records were produced in the linger.ms interval. The default partitioner would pick a partition at random, and then write all of those records into a single batch for that partition because all 10 KiB worth of records would fit into the 131,072 limit. However, the round-robin partitioner would round-robin the records across 10 different partitions, creating 10 different batches, each with one record in it. The throughput is unchanged at 1 GiB/s, but now the system has to process 10x as many batches!

Almost no one publishes benchmark results with this workload configuration because it reduces throughput and increases latency. But this workload is really important in the real world. The ability of a Kafka implementation to handle a huge number of batches per second is directly proportional to how easy that system will be to manage, and how well it will hold up in real production workloads where clients care which partitions records are written to, and are optimizing for application level concerns, not microbenchmarks.

name: round_robin topics: 1 partitionsPerTopic: 288 messageSize: 1024 useRandomizedPayloads: true randomBytesRatio: 0.5 randomizedPayloadPoolSize: 1000 subscriptionsPerTopic: 1 consumerPerSubscription: 16 producersPerTopic: 64 producerRate: 760000 consumerBacklogSizeGB: 0 testDurationMinutes: 1600 keyDistributor: "KEY_ROUND_ROBIN"

As we said before, this workload is really hard for any data streaming system. So we toned down the partition count and consumer fan out back down to what you’re used to seeing in a standard benchmark, but we kept the number of producers and consumers high. We also set the partition to the round-robin strategy. Finally, we reduced the producer rate down to 750 MiB/s.

To be fair, Apache Kafka can handle this workload as well, especially if fsync is disabled as is common in most vendor benchmarks. However, we wanted to make sure to demonstrate that WarpStream handles this workload just as well. WarpStream was built to handle the messy realities of production, not just theoretical benchmarks.

Conclusion

In summary, WarpStream handles a variety of extremely difficult workloads with ease. While WarpStream does have higher producer and end-to-end latency than other Kafka implementations, it still achieved sub-second P99 producer latency and sub-2s P99 end-to-end latency in (almost) every workload we tested.

WarpStream offers 4-10x reduced cloud infrastructure costs by eliminating inter-AZ networking fees entirely, and using the lowest cost storage available in cloud environments: commodity object storage. Remember, even with our generous assumption of 5x compression over the wire between producers and brokers, we could have used 5x more hardware with WarpStream and still reduced costs compared to Apache Kafka, not because Apache Kafka is bad, but because inter-AZ networking is just that expensive.

We also hope that this post demonstrates just how low-touch WarpStream is compared to other streaming platforms. WarpStream is completely stateless–no quorums, no Raft leaders, or any other state to maintain, and no manual intervention when a node dies and is replaced. In addition, scaling the system up or down is literally as easy as adding or removing containers; no additional operations required.

In the near future we will follow this post with another round of benchmarks using Amazon’s recently launched S3 Express product to demonstrate how WarpStream can trade-off between cost and latency using the same architecture simply by swapping out the underlying storage medium. Stay tuned!

Like what you see? Contact us, or join our community Slack!

Return To Blog
Return To Blog