Discretized streams - Concepedia

TLDR

Real‑time big‑data applications require scalable parallel platforms that automatically handle faults and stragglers, yet existing distributed stream processing models rely on costly hot replication or long recovery times and fail to address stragglers. The authors propose discretized streams (D‑Streams) as a new processing model that overcomes these challenges. D‑Streams implement a parallel recovery mechanism that improves efficiency over traditional replication, tolerates stragglers, can be composed with batch and interactive query models such as MapReduce, and are realized in the Spark Streaming system. The authors demonstrate that D‑Streams support a rich set of operators, achieve per‑node throughput comparable to single‑node systems, scale linearly to 100 nodes, and provide sub‑second latency and fault recovery.

Abstract

Many "big data" applications must act on data in real time. Running these applications at ever-larger scales requires parallel platforms that automatically handle faults and stragglers. Unfortunately, current distributed stream processing models provide fault recovery in an expensive manner, requiring hot replication or long recovery times, and do not handle stragglers. We propose a new processing model, discretized streams (D-Streams), that overcomes these challenges. D-Streams enable a parallel recovery mechanism that improves efficiency over traditional replication and backup schemes, and tolerates stragglers. We show that they support a rich set of operators while attaining high per-node throughput similar to single-node systems, linear scaling to 100 nodes, sub-second latency, and sub-second fault recovery. Finally, D-Streams can easily be composed with batch and interactive query models like MapReduce, enabling rich applications that combine these modes. We implement D-Streams in a system called Spark Streaming.

References

Page 1

	Year	Citations

Page 1