KOHA: Building a Kafka-Based Distributed Queue System on the Fly in a Hadoop Cluster

Abstract

Message queues take a crucial role in a distributed and scalable system by interconnecting loosely-coupled and autonomic computational units. Among the state of art distributed message queue systems, Apache Kafka has been able to achieve high throughput, low latency, and good load-balancing. Recently, we have worked on developing a new data processing framework that can efficiently handle a very large number of tasks on top of a Hadoop cluster by effectively leveraging Kafka as a job queue, which motivated us to explore more opportunities of utilizing Kafka in the Hadoop platform. The Apache Hadoop has already become the de facto big data processing infrastructure and with the help of YARN, it is now evolving into multi-use data platform that can harness various types of data processing workflows. Therefore, effectively utilizing Kafka for various purposes including message distribution, task processing, metadata management in a Hadoop cluster can potentially contribute to the expansion of current Hadoop ecosystem. In this paper, we design and implement a framework called KOHA (Kafka On HAdoop) that provides users with a simple, convenient and powerful way to develop a large-scale distributed Kafka-based application running on top of a Hadoop cluster. The framework automatically builds and starts Kafka brokers on the fly and allocates resources to launch producers and consumers. Users can use the framework to adopt Apache Kafka without any understanding of YARN programming model and efforts to deploy a Kafka cluster. In addition, we also present a use case of the framework to evaluate Kafka's performance with various test cases and working scenarios. The experimental results allow Kafka's potential users to perceive the influences of different settings on the queuing performance.

References

Page 1

	Year	Citations

Page 1