Introducing WarpStream BYOC Schema Registry

Brian Shih

Software Engineer

November 25, 2024

introducing-warpstream-byoc-schema-registry

HN Disclosure: WarpStream sells a drop-in replacement for Apache Kafka built directly on-top of object storage.

Schema Registry, Redesigned

Our vision at WarpStream is to build a BYOC streaming platform that is secure, simple to operate, and cost-effective. As the first step towards that vision, we built WarpStream BYOC, a reimplementation of the Kafka protocol with a stateless, zero disk architecture (read more about Diskless Kafka) that is purpose-built for the cloud. This greatly reduces the operational burden of running Kafka clusters, by replacing the stateful Kafka brokers with stateless WarpStream Agents. However, there’s more to data streaming than just the Kafka clusters themselves.

Many organizations deploy a schema registry alongside their Kafka clusters to help ensure that all of their data uses well-known and shared schemas. Unfortunately, existing schema registry implementations are stateful, distributed systems that are not trivial to operate, especially in a highly available way. When deploying and maintaining them, you may have to worry about leader election, managing disks, and data rebalances.

Alternatively, you can offload the deployment and maintenance of your schema registry to an external, cloud-managed version. There is a lot to be said for offloading your data governance to a third party – you don’t have to deal with deploying or managing any infrastructure, and in Confluent Cloud you can take advantage of features such as Confluent’s Stream Governance. But for some customers, offloading the schemas, which contain the shape of the data, to a third party is not an option. That is one of the reasons why we felt that a stateless, BYOC schema registry was an important piece of WarpStream’s BYOC data streaming puzzle.

We’re excited to announce the release of WarpStream’s BYOC Schema Registry, a schema registry implementation that is API-compatible with Confluent’s Schema Registry, but deployed using WarpStream’s BYOC deployment model and architected with WarpStream’s signature data plane / control plane split. All your schemas sit securely in your own cloud environment and object storage buckets, with WarpStream responsible for scaling the metadata (schema ID assignments, concurrency control, etc).

In this blog, we will dive deeper into the architecture of WarpStream’s BYOC Schema Registry and explain the design decisions that went into building it.

Architecture Overview

The BYOC Schema Registry comes with all the benefits of WarpStream’s BYOC model and is designed with the following properties:

Zero disk architecture (aka Diskless Kafka)
Separation of storage and compute
Separation of data from metadata
Separation of the data plane from the control plane

The Schema Registry is embedded natively into the stateless Agent binary. To deploy a schema registry cluster, simply deploy the Agent binary into stateless containers and provide the Agent with permissions to communicate with your object storage bucket and WarpStream’s control plane.

*Simplified view of the schemas being stored in object storage and metadata being offloaded to the control plane.*

All schemas live in object storage with no intermediary disks. The only data that leaves your environment is metadata sent to WarpStream’s control plane, such as the schema ID assigned to each schema. Due to the stateless nature of the agents, scaling the schema registry during read spikes is as easy as scaling up stateless web servers.

Everyone Can Write

Kafka’s open-source Schema Registry is designed to be a distributed system with a single primary architecture, using Zookeeper or Kafka to elect the primary and using a Kafka log for storage. Under this architecture, only the elected leader can act as the “primary” and write to the underlying Kafka log. The leader is then mirrored to read-only replicas that can serve read requests.

One downside of this architecture is that when the leader is down, the cluster will be unable to serve write requests until a new leader is elected. This is not the case for WarpStream Agents. In WarpStream’s BYOC Schema Registry, no agent is special and any agent can serve both write and read requests. This is because metadata coordination that requires consensus, such as the assignment of globally unique schema IDs to each schema, is offloaded to WarpStream’s highly available and fully managed metadata store.

Minimizing Object Storage API Calls

Object storage API calls are both costly and slow. Therefore, one of our design goals is to minimize the number of API calls to object storage. Even though most schema registry clients will cache fetched schemas, we designed WarpStream’s Schema Registry to handle the extreme scenario where thousands of clients restart and query the schema registry at the same time.

Without any caching on the agents, the number of API calls to object storage grows linearly to the number of clients. By caching the schema, each agent will only fetch each schema once, until the cache evicts the schema. However, the number of object storage API calls still grows linearly to the number of agents. This is because it’s not guaranteed that all read requests for a specific schema ID will always go to the same agent. Whether you use WarpStream’s service discovery system (covered in the next section) or your own HTTP load balancer, the traffic will likely be distributed amongst the agents quite evenly, so each agent would still have to fetch from object storage once for each schema. We were not satisfied with this.

Ideally, each schema is downloaded from object storage once and only once per availability zone, across all agents. What we need here is an abstraction that looks like a “distributed mmap” in which each agent is responsible for caching data for a subset of files in the object storage bucket. This way, when an agent receives a read request for a schema ID and the schema is not in the local cache, it will fetch the schema from the agent responsible for caching that schema file instead of from object storage.

Luckily, we already built the “distributed mmap” abstraction for WarpStream! The distributed file cache explained in this blog uses a consistent hash ring to make each agent responsible for caching data for a subset of files. The ID of the file is used as the hash key for the consistent hashing ring.

As shown in this diagram, when agent 3 receives fetch requests for schemas with IDs 1 and 2, it fetches the schemas from agent 1 and agent 2, respectively, and not from object storage.

An added benefit of using the distributed file cache is that the read latency of a newly booted agent won’t be significantly worse than the latency of other agents as it won’t need to hydrate its local cache from object storage. This is important because we don’t want latency to drop significantly when scaling up new agents during read spikes.

Minimizing Inter-zone Networking Calls

While easy to miss, inter-zone networking fees are a real burden on many companies’ bottom lines. At WarpStream we keep this constraint top of mind so that you don’t have to. WarpStream’s BYOC Schema Registry is designed to eliminate interzone networking fees. To achieve that, we needed a mechanism for you to configure your schema registry client to connect to a WarpStream Agent in the same availability zone. Luckily, we already ran into the same challenge when building WarpStream (check out this blog for more details).

The solution that works well for WarpStream’s BYOC Schema Registry is zone-aware routing using zone-specific URLs. The idea behind zone-specific URLs is to provide your schema registry clients with a zone-specific schema registry URL that resolves to an Agent’s IP address in the same availability zone.

When you create a WarpStream Schema Registry, you automatically get a unique schema registry URL. To create the zone-specific URL, simply embed the client’s availability zone into the schema registry URL. For example, the schema registry URL for a client running in us-east-1a might look like this:

<span class="codeinline">api-11155fd1-30a3-41a5-9e2d-33ye5a71bfd9.us-east-1a.discovery.prod-z.us-east-1.warpstream.com:9094</span>

<br>

When the schema registry client makes a request to that URL, it will automatically connect to an Agent in the same availability zone. Zone-aware routing is made possible with two building blocks: WarpStream’s service discovery system and custom zone-aware DNS server.

Simplified diagram of zone-aware routing. Each Heartbeat contains the Agent’s IP address and availability zone.

The way service discovery works is that each Agent will send periodic “heartbeat” requests to WarpStream’s service discovery system. Each request contains the Agent’s IP address and its availability zone. Thus, the service discovery system knows all the available Agents and their availability zones.

When the schema registry client initiates a request to the zone-specific schema registry URL, the DNS resolver will send a DNS query to WarpStream’s custom zone-aware DNS server. The DNS server will first parse the domain to extract the embedded availability zone. The DNS server will then query the service discovery system for a list of all available Agents, and return only the IP addresses of the Agents in the specified availability zone. Finally, the client will connect to an Agent in the same AZ. Note that if no Agents are in the same AZ as the client, the DNS server will return the IP addresses of all available Agents.

While not required for production usage, zone-aware routing can help reduce costs for high-volume schema registry workloads.

Schema Validation Made Easy

When configured to perform server-side schema validation, your Kafka agent needs to fetch schemas from a schema registry to check if incoming data conforms to their expected schemas. Normally, the Kafka agent fetches schemas from an external schema registry via HTTP. This introduces a point of failure - the Kafka agent won’t be able to handle produce requests if the schema registry is down. This is not a problem if the agent performs schema validation with WarpStream’s BYOC Schema Registry.

An advantage of the shared storage architecture of the BYOC Schema Registry is that no compute instance “owns” the schemas. All schemas live in object storage. As a result, the Kafka agent can fetch schemas directly from object storage instead of the schema registry agents. In other words, you don’t need any schema registry agents running and schema validation will still work - one less service dependency you have to worry about.

Next Steps

WarpStream’s BYOC Schema Registry is the newest addition to WarpStream’s BYOC product. Similar to how WarpStream is a cloud-native redesign of the Kafka protocol, WarpStream’s BYOC Schema Registry is a reimplementation of the Kafka Schema Registry API, bringing all the benefits of WarpStream’s BYOC deployment model to your schema registries.

When building WarpStream’s BYOC Schema Registry, we spent deliberate effort to minimize your operational cost and infrastructure bills, with techniques like zone-aware routing and distributed file cache.

If you want to get started with WarpStream’s BYOC Schema Registry, you can have a Schema Registry agent running locally on your laptop in under 30 seconds with the playground / demo command. Alternatively, you can navigate to the WarpStream Console, configure a WarpStream Schema Registry virtual cluster, and then deploy the schema registry agents in your VPC. To learn more about how to use WarpStream’s BYOC Schema Registry, check out the docs.

‍

Get started with WarpStream today and get $400 in credits that do not expire. No credit card is required to start.