Join processing using Bloom filter in MapReduce

Abstract

MapReduce is a programming model which is extensively used for large-scale data analysis. The join operation is one of the essential operations for the data analysis. However, MapReduce is not very efficient to perform the join operation since it always processes all records in the datasets even in the cases that only small fraction of datasets are relevant for the join operation. We alleviate this problem by applying bloomjoin algorithm, a classic distributed join algorithm. We improve the join performance using Bloom filters in MapReduce. In our approach, the Bloom filters are constructed in distributed fashion and are used to filter out redundant intermediate records. In order to apply the Bloom filters in MapReduce, we modify Hadoop to assign the input datasets to map tasks sequentially, and we propose a method to determine the processing order of input datasets based on the estimated cost. Our experimental results show that the number of intermediate results is decreased and the join performance can be improved in our architecture.

References

Page 1

	Year	Citations

Page 1