Parallel Algorithms for Distance-Based and Density-Based Outliers

Abstract

An outlier is an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism. Outlier detection has many applications, such as data cleaning, fraud detection and network intrusion. The existence of outliers can indicate individuals or groups that exhibit a behavior that is very different from most of the individuals of the dataset. In this paper we design two parallel algorithms, the first one is for finding out distance-based outliers based on nested loops along with randomization and the use of a pruning rule. The second parallel algorithm is for detecting density-based local outliers. In both cases data parallelism is used. We show that both algorithms reach near linear speedup. Our algorithms are tested on four real-world datasets coming from the Machine Learning Database Repository at the UCI.

References

Page 1

	Year	Citations

Page 1