Structured Streaming

TLDR

Real‑time data ubiquity drives demand for scalable, user‑friendly streaming systems, and Structured Streaming distinguishes itself from other APIs like Google Dataflow by offering a purely declarative approach. The authors introduce Structured Streaming as a high‑level Spark API designed to enable end‑to‑end real‑time applications that combine streaming with batch and interactive analytics. Structured Streaming implements a declarative API that automatically incrementalizes static relational queries, and the authors detail its design and real‑world use cases from hundreds of Databricks deployments processing over 1 PB per month. The authors report that integrating streaming with batch and interactive analytics is a key practical challenge, yet Structured Streaming delivers high performance—up to twice that of Flink and 90× that of Kafka Streams—while providing operational features like rollbacks, code updates, and mixed streaming/batch execution.

Abstract

With the ubiquity of real-time data, organizations need streaming systems that are scalable, easy to use, and easy to integrate into business applications. Structured Streaming is a new high-level streaming API in Apache Spark based on our experience with Spark Streaming. Structured Streaming differs from other recent streaming APIs, such as Google Dataflow, in two main ways. First, it is a purely declarative API based on automatically incrementalizing a static relational query (expressed using SQL or DataFrames), in contrast to APIs that ask the user to build a DAG of physical operators. Second, Structured Streaming aims to support end-to-end real-time applications that integrate streaming with batch and interactive analysis. We found that this integration was often a key challenge in practice. Structured Streaming achieves high performance via Spark SQL's code generation engine and can outperform Apache Flink by up to 2x and Apache Kafka Streams by 90x. It also offers rich operational features such as rollbacks, code updates, and mixed streaming/batch execution. We describe the system's design and use cases from several hundred production deployments on Databricks, the largest of which process over 1 PB of data per month.

References

Page 1

	Year	Citations

Page 1