Tiered storage is a hot topic in the world of data streaming systems, and for good reason. Cloud disks are (really) expensive, object storage is cheap, and in most cases, live consumers are just reading the most recently written data. Paying for expensive cloud disks to store historical data isn’t cost-effective, so historical data should be moved (tiered) to object storage. On paper, it makes all the sense in the world.
Chances are you probably had a strong reaction to the title of this post. In our experience, Kafka is one of the most polarizing technologies in the data space. Some people hate it, some people swear by it, but almost every technology company uses it.
In this post, I’ll start off with a brief overview of “shared nothing” vs. “shared storage” architectures in general. This discussion will be a bit abstract and high-level, but the goal is to share with you some of the guiding philosophy that ultimately led to WarpStream’s architecture.
Orbit is a tool which creates identical, inexpensive, scaleable, and secure continuous replicas of Kafka clusters. It is built into WarpStream and works without any user intervention to create WarpStream replicas of any Apache Kafka-compatible source cluster.
WarpStream now supports AWS Glue Schema Registries, in addition to the Kafka-compatible schema registries. The WarpStream Agent can use schemas stored in the user’s AWS Glue Schema Registries to validate records.
Backpressure is a really simple concept. When the system is nearing overload, it should start “saying no” by slowing down or rejecting requests. Of course, the big question is: How do we know when we should reject a request?
Traditional offset-based monitoring can be misleading due to varying message sizes and consumption rates. To address this, you can introduce a time-based metric for a more accurate assessment of consumer group lag.
How we built support for running WarpStream's control plane and Metadata Store in multiple regions, while still presenting our platform as a single pane of glass.
WarpStream's Zero Disk Architecture enables a BYOC deployment model that is secure by default and does not require any external access to the customer's environment.
Follow up to "Tiered Storage Won't Fix Kafka", this post covers all the different advantages that WarpStream's Zero Disk Architecture provides over Apache Kafka.
Managed Data Pipelines provide a fully-managed SaaS user experience for Bento, without sacrificing any of the cost benefits, data sovereignty, or deployment flexibility of the BYOC deployment model.
Tiered storage is a hot topic in the world of data streaming systems, and for good reason. Cloud disks are (really) expensive, object storage is cheap, and in most cases, live consumers are just reading the most recently written data. Paying for expensive cloud disks to store historical data isn’t cost-effective, so historical data should be moved (tiered) to object storage. On paper, it makes all the sense in the world.
This blog is guest authored by Fahad Shah from RisingWave, and cross-posted from RisingWave's blog. In this blog, we have presented the development of a real-time security threat monitoring system that integrates RisingWave, WarpStream, and Grafana. The setup process for the entire system is quite straightforward. To monitor each metric, you only need to create a single materialized view in RisingWave and visualize it in Grafana.
We’re excited to announce that WarpStream now natively embeds Bento, a stateless stream processing framework that connects to many data sources and sinks. Bento offers much of the functionality of Kafka Connect, as well as additional lightweight stream processing functions.
Many of today's most highly adopted open source “big data” infrastructure projects – like Cassandra, Kafka, Hadoop, etc. – follow a common story. A large company, startup or otherwise, faces a unique, high scale infrastructure challenge that's poorly supported by existing tools. They create an internal solution for their specific needs, and then later (kindly) open source it for the greater community to use. Now, even smaller startups can benefit from the work and expertise of these seasoned engineering teams. Great, right?
How we leverage Antithesis to deterministically simulate our entire SaaS platform and verify its correctness, all the way from signup to running entire Kafka workloads.
Benchmarking databases – and maintaining fairness and integrity while doing so – is a notoriously difficult task to get right, especially in the data streaming space. Vendors want their systems to produce mouth watering results, and so unnatural configurations divorced from customer realities (AKA “vanity” benchmarks) get tested, and it's ultimately the end-user that is left holding the bag when they realize that their actual TCO is a lot higher than they were led to believe.
A huge part of building a drop-in replacement for Apache Kafka® was implementing support for compacted topics. The primary difference between a “regular” topic in Kafka and a “compacted” topic is that Kafka will asynchronously delete records from compacted topics that are not the latest record for a specific key within a given partition.
Serverless products and usage based billing models go hand in hand, almost by definition. A product that is truly serverless effectively has to have usage based pricing, otherwise it’s not really serverless!
Middleware is a cloud observability company that uses AI to help customers identify, understand, and fix issues across their infrastructure. It’s probably not a surprise to you that their entire business relies on scalable streaming infrastructure: they’re ingesting and processing tens of TiBs of telemetry every day, and need to process it very efficiently.
We first introduced WarpStream in our blog post: "Kafka is Dead, Long Live Kafka", but to summarize: WarpStream is a Kafka protocol compatible data streaming system built directly on top of object storage.
If you’re on a Data or Data Platforms team, you’ve probably already seen the productivity boost that comes from pulling business logic out of various ETL pipelines, queries, and scripts and centralizing it in SQL in a clean, version-controlled git repo managed by dbt. The engine that unlocked this approach is the analytical data warehouse: typically Snowflake or BigQuery.
Chances are you probably had a strong reaction to the title of this post. In our experience, Kafka is one of the most polarizing technologies in the data space. Some people hate it, some people swear by it, but almost every technology company uses it.