Schemas in Apache Kafka® enable operators to ensure that their data conforms to the expected schema, and prevent data quality and compliance issues such as rogue producers writing data to Kafka topics that shouldn’t be there. This can be problematic in cases where downstream applications expect only to receive records that conform to a specific schema and a specific format, i.e., ETL applications that write to a database.
Schemas have become ubiquitous in Kafka, and are a key component of any data governance, compliance, and platform management regime.
Historically, schemas in Kafka have been implemented as a client-side feature in order to reduce load on the stateful Kafka Brokers. Kafka uses an external server (a Schema Registry) to store schemas, and the producer client periodically polls the registry and caches the schemas and their IDs. Before writing to Kafka, the producer client serializes the data and validates that the record is compatible with the schema retrieved from Schema Registry. If the record is not compatible with the schema, the serializer throws an error and the producer does not produce the record to Kafka, which protects against incorrect data being written to Kafka from our producer. If the record is compatible, the producer writes the data with a schema ID and schema prepended to the record. On the other side, consumers look up the schema from Schema Registry, and if the schema on the record is compatible with the schema in Schema Registry, the consumer deserializes the record. If not, the consumer throws an error.
While this implementation satisfies the basic requirement to add schemas to records in Kafka, it lacks broker-side validation, which means that schemas are entirely a client-side feature. That’s problematic because it relies on clients to always do the right thing. The Kafka broker will happily accept whatever it’s given by any Kafka client, so while the client can validate that its own writes and reads conform to the proper schema, there is nothing that prevents another client from writing data that does not conform to the schema defined by a well-behaved producer. Broker-side validation is necessary to implement centralized data governance policies.
Various data governance products have been launched that enable the Kafka broker to do some schema validation, however these features are limited in their utility because they can only validate that the schema ID matches the schema ID from the Schema Registry, not that the schema of the record actually matches the expected schema.
WarpStream is excited to announce that users can now connect WarpStream Agents to any Kafka-compatible Schema Registry and validate that records conform to the provided schema!
WarpStream Schema Validation validates not only that the schema ID encoded in a given record matches the schema ID in the Schema Registry, but also that the record actually conforms with the provided schema. In addition, WarpStream Schema Validation adds a “warning-only” configuration property, which, when enabled, emits a metric to identify rejected records instead of rejecting them, providing easier testing and monitoring during schema migrations without implementing separate dead-letter queues or risking data loss. WarpStream Schema Validation is built into the WarpStream Agent, so the Agent does this validation in the customer’s environment, and no data is exfiltrated.
To connect the WarpStream Agents with a Schema Registry, specify the optional <span class="codeinline">-schemaRegistryURL</span> flag in the Agent configuration. WarpStream supports Basic, TLS, and mTLS authentication between the Agent and the Schema Registry.
Once the Agents are connected to a compatible Schema Registry, WarpStream Schema Validation can be enabled with the following topic-level configurations:
Enabling record-level validation with an external schema registry does increase the CPU load for the Agents. But unlike Kafka brokers, which cannot be auto scaled without significant operational toil and risk of data loss, WarpStream Agents are completely stateless and can be scaled elastically based on basic parameters like CPU utilization. This means that unlike Kafka, a WarpStream cluster can be scaled automatically on the fly, so there’s no need to permanently provision more Agents in anticipation of increased CPU utilization.
In addition, using Agent Roles, WarpStream can isolate parts of a workload to a specific set of Agents, which reduces the impact of increased load caused by Schema Validation. Schema Validation uses the <span class="codeinline">proxy-produce</span> role, so Agents handling <span class="codeinline">Produce()</span> requests can be isolated from the rest of the cluster and scaled independently.
Currently, WarpStream supports validating JSON and Avro schemas, with support for Protobuf coming soon. And while the current implementation of WarpStream Schema Validation utilizes external Schema Registries, we are also currently working on building our own WarpStream-native schema registry.
To learn more about WarpStream Schema Validation, read the docs, or contact us.