Publication | Closed Access
MapReduce-based similarity join for metric spaces
24
Citations
14
References
2012
Year
Unknown Venue
Cluster ComputingEngineeringSimilarity MeasureMap-reduceMapreduce-based Similarity JoinCloud SystemsInformation RetrievalData ScienceData MiningManagementData IntegrationCloud Data ManagementData ManagementKnowledge DiscoveryComputer ScienceDistributed Query ProcessingPresent MrsimjoinRelational QueriesSimilarity JoinCloud ComputingSimilarity SearchMassive Data ProcessingBig Data
Cloud enabled systems have become a crucial component to efficiently process and analyze massive amounts of data. One of the key data processing and analysis operations is the Similarity Join, which retrieves all data pairs whose distances are smaller than a predefined threshold ε. Even though multiple algorithms and implementation techniques have been proposed for Similarity Joins, very little work has addressed the study of Similarity Joins for cloud systems. This paper focuses on the study, design and implementation techniques of cloud-based Similarity Joins. We present MRSimJoin, a MapReduce based algorithm to efficiently solve the Similarity Join problem. This algorithm efficiently partitions and distributes the data until the subsets are small enough to be processed in a single node. MRSimJoin is general enough to be used with data that lies in any metric space, thus it can be used with multiple data types and distance functions. We present guidelines to implement the algorithm in Hadoop, an open-source cloud system. The experimental evaluation of MRSimJoin shows that it has very good execution time and scalability properties.
| Year | Citations | |
|---|---|---|
Page 1
Page 1