Announcing Schema Validation with AWS Glue Schema Registry

Sep 25, 2024
Brian Shih
HN Disclaimer: WarpStream sells a drop-in replacement for Apache Kafka built directly on-top of object storage.

WarpStream’s Bring Your Own Cloud (BYOC) deployment model isn’t just about making sure that your data never leaves your environment. It’s also about making sure that the integration between WarpStream and the rest of your tech stack is as seamless as possible. This includes allowing you to integrate WarpStream with whichever schema registry you already use.

A few months ago, we announced WarpStream schema validation, which allows users to connect WarpStream Agents to any Kafka-compatible Schema Registry and validate that the records conform to the expected schema. 

When we originally built out the schema validation feature, we designed it to work with popular Kafka schema registries such as the Confluent Schema Registry. However, many of our customers run WarpStream in AWS, and in our conversations with them, we learned that some of them are using AWS Glue as a schema registry for Kafka, not a Kafka-specific schema registry.

We are happy to share that we now support AWS Glue Schema Registries, in addition to the Kafka-compatible schema registries. With this integration, the WarpStream Agent can use schemas stored in the user’s AWS Glue Schema Registries to validate records.

How Schema Validation Works with AWS Glue Schema Registry

Even though Kafka and AWS Glue have different serialization formats for schema registry, the formats are quite similar. Most importantly, they both contain some form of unique schema identifier that points to a schema in the schema registry.

Here is the serialization format for Confluent Schema Registry:

Here is the serialization format for AWS Glue Schema Registry:

When the WarpStream Agent receives a serialized record from the producer, it first extracts the schema identifier. In the case of Confluent’s SerDes format, this is the 4-byte integer representing a schema ID. In the case of AWS Glue’s SerDes format, this is the 16-byte UUID representing a schema version ID. It will then fetch the schema from the schema registry with the identifier and cache it in memory. Finally, it uses the schema to validate incoming records.

Note that the AWS Glue serialized record may contain ZLib compressed data (this is identified by checking if the second byte of the encoded data is 5). In that case, WarpStream’s Agent will need to uncompress the data in order to validate it. This does increase CPU and memory utilization, which is the tradeoff for enabling the feature.  But since Agents are stateless, it’s a lot easier to scale up the number of Agents than it is for stateful brokers in Kafka.

Using WarpStream in Your ETL Workflows

One of AWS Glue’s primary use cases is to use its ETL jobs to ingest data from streaming sources and feed them into your data lakes and data warehouses. This use case aligns with WarpStream’s strengths, particularly its ability to deal with data pipelines with high throughput and relaxed latency requirements.

With WarpStream’s first-class AWS Glue integration, you can seamlessly replace any of your existing data streams in your AWS Glue ETL workflows with WarpStream. The best part is that the WarpStream Agents will continue to use your existing AWS Glue Schema Registries to enforce the data quality of the streams. This saves you the hassle of migrating your schemas out of your AWS Glue schema registry and running a separate schema registry for your Kafka clusters.

If you want to learn more about integrating WarpStream Agents with your AWS Glue Schema Registry, check out our documentation. Also, stay tuned for WarpStream’s own BYOC Schema Registry that is launching soon!

Create a free WarpStream account and start streaming with $400 in free credits.
Get Started
Author
Brian Shih
Software Engineer
Return To Blog
Return To Blog