Kafka Replication Without the (Offset) Gaps

Back To Blog

RSS

Arjun Nair

Software Engineer

November 13, 2024

kafka-replication-without-the-offset-gaps

HN Disclosure: WarpStream sells a drop-in replacement for Apache Kafka built directly on-top of object storage.

Introducing Orbit

Orbit is a tool which creates identical, inexpensive, scaleable, and secure continuous replicas of Kafka clusters.

It is built into WarpStream and works without any user intervention to create WarpStream replicas of any Apache Kafka-compatible source cluster like open source Apache Kafka, WarpStream, Amazon MSK, etc.

Records copied by Orbit are offset preserving. Every single record will have the same offset in the destination cluster as it had in the source cluster, including any offset gaps. This feature ensures that your Kafka consumers can be migrated transparently from a source cluster to WarpStream, even if they don’t store their offsets using the Kafka consumer group protocol.

Why Did We Build Orbit?

There are existing tools in the Kafka ecosystem for replication, specifically MirrorMaker. So why did we build something new?

Orbit solves two big problems that MirrorMaker doesn’t – it creates perfect replicas of source Kafka clusters (for disaster recovery, performant tiered storage, additional read replicas, etc.), and also provides an easy migration path from any Kafka-compatible technology to WarpStream.

Offset-Preserving Replication

Existing tools in the ecosystem like MirrorMaker are not offset preserving[1]. Instead, MirrorMaker creates and maintains an offset mapping which is used to translate consumer group offsets from the source cluster to the destination cluster as they’re copied. This offset mapping is imprecise because it is expensive to maintain and cannot be stored for every single record.

MirrorMaker style imprecise offset mappings

Offset mapping and translation in MirrorMaker has two problems:

When a consumer participating in the consumer group protocol is migrated to a destination cluster, it is likely that there is an unfixed amount of duplicate consumption of records as the last offset mapping for the topic partition could be much smaller than the last actually-committed consumer group offset.
MirrorMaker does not perform offset translation for offsets stored outside the consumer group protocol. In practice, a lot of very popular technology that interacts with Apache Kafka (like Flink and Spark Streaming, for example) store their offsets externally and not in Apache Kafka.

This means that tools like MirrorMaker can’t be used to safely migrate every Apache Kafka application from one cluster to another.

Orbit, on the other hand, is offset preserving. That means instead of maintaining an offset mapping between the source and destination cluster, it ensures that every record that is replicated from the source cluster to the destination one maintains its exact offset, including any offset gaps. It’s not possible to do this using the standard Apache Kafka protocol, but since Orbit is tightly integrated into WarpStream we were able to accomplish it using internal APIs.

Orbit writes records at the same offset as the source

This solves the two problems with MirrorMaker. Since Orbit ensures that the offset of every single record written to the destination has exactly the same offset as the source, consumer group offsets from the source can be copied over without any translation.

Moreover, applications which store offsets outside of the consumer group protocol can still switch consumption from the source cluster to WarpStream seamlessly because the offsets they were tracking outside of Kafka map to the exact same records in WarpStream that they mapped to in the source cluster.

In summary, offset-preserving replication is awesome because it eliminates a huge class of Apache Kafka replication edge cases, so you don’t have to think about them.

Cohesion and Simplicity

Orbit is fully integrated with the rest of WarpStream. It is controlled by a stateless scheduler in the WarpStream control plane which submits jobs which are run in the WarpStream Agents. Just like the rest of WarpStream, the metadata store is considered the source of truth and the Agents are still stateless and easy to scale.

You don’t need to learn how to deploy and monitor another stateful distributed system like MirrorMaker to perform your migration. Just spin up WarpStream Agents, edit the following YAML file in the WarpStream Console, hit save, and watch your data start replicating. It’s that easy.

To make your migrations go faster, just increase the source cluster fetch concurrency from the YAML and spin up more stateless WarpStream Agents if necessary.

Click ops not your cup of tea? You can use our terraform provider or dedicated APIs instead.

The Kafka Protocol is Dark and Full of Terrors

Customers building applications using Kafka shouldn't have to worry that they haven't considered every single replication edge case, so we obsessively thought about correctness and dealt with edge cases that come up during async replication of Kafka clusters.

As a quick example, it is crucial that the committed consumer group offset of a topic partition copied to the destination is within the range of offsets for the topic partition in the destination. Consider the following sequence of events which can come up during async replication:

There exists a topic A with a single partition 0 in the source cluster.
Records in the offset range 0 to 1000 have been copied over to the destination cluster.
A committed consumer group offset of 1005 is copied over to the destination cluster.
A Kafka client tries to read from the committed offset 1005 from the destination cluster.
The destination cluster will return an offset out of range error to the client.
Upon receiving the error, some clients will begin consuming from the beginning of the topic partition by default, which leads to massive duplicate consumption.

To ensure that we catch other correctness issues of this nature, we built a randomized testing framework that writes records, updates the data and metadata in a source cluster, and ensures that Orbit keeps the source and destination perfectly in sync.

As always, we sweat the details so you don’t have to!

Use Cases

Once you have a tool which you can trust to create identical replicas of Kafka clusters for you, and the destination cluster is WarpStream, the following use cases are unlocked:

Migrations

Orbit keeps your source and destination clusters exactly in sync, copying consumer group offsets, topic configurations, cluster configurations, and more. The state in the destination cluster is always kept consistent with the source.

Orbit can, of course, be used to migrate consumers which use the Consumer Group protocol, but since it is offset preserving it can also be used to migrate applications where the Kafka consumer offsets are stored outside of the source Kafka cluster.

Disaster Recovery

Since the source and destination clusters are identical, you can temporarily cut over your consumers to the destination WarpStream cluster if the source cluster is unavailable.

The destination WarpStream cluster can be in another region from your source cluster to achieve multi-region resiliency.

Cost-Effective Read Replicas

Replicating your source clusters into WarpStream is cheaper than replicating into Apache Kafka because WarpStream’s architecture is cheaper to operate:

All the data stored in WarpStream is only stored in object storage, which is 24x cheaper than local disks in the cloud.
WarpStream clusters incur zero inter-zone networking fees, which can be up to 80% of the cost of running a Kafka cluster in the cloud.
WarpStream clusters auto-scale by default because the Agents themselves are completely stateless, so your WarpStream cluster will always be perfectly right-sized.

This means that you can use the WarpStream cluster replica to offload secondary workloads to the WarpStream cluster to provide workload isolation for your primary cluster.

Performant Tiered Storage

We’ve written previously about some of the issues that can arise when bolting tiered storage on after the fact to existing streaming systems, as well as how WarpStream mitigates those issues with its Zero Disk Architecture. One of the benefits of Orbit is that it can be used as a cost effective tiered storage solution that is performant and scalable by increasing the retention of the replicated topics in the WarpStream cluster to be higher than the retention in the source cluster.

Start Migrating Now

Orbit is available for any BYOC WarpStream cluster. You can go here to read the docs to see how to get started with Orbit, learn more via the Orbit product page, or contact us if you have questions. If you don’t have a WarpStream account, you can create a free account. All new accounts come pre-loaded with $400 in credits that never expire and no credit card is required to start.

Notes

[1] While Confluent Cluster Linking is also offset preserving it cannot be used for migrations into WarpStream.

Get started with WarpStream today and get $400 in credits that do not expire. No credit card is required to start.