Sampling Large Databases for Association Rules

TLDR

Association rule discovery is a key database mining problem, yet existing algorithms require multiple database passes, making I/O overhead significant for very large databases. The authors introduce algorithms that reduce database activity for association rule mining. They employ random sampling to identify likely rules and then verify them on the full database, using a probabilistic approach that may need a second pass for missed rules. Experiments demonstrate that the algorithms yield exact association rules and can discover them efficiently in a single database pass.

Abstract

Discovery of association rules .is an important database mining problem. Current algorithms for finding association rules require several passes over the analyzed database, and obviously the role of I/O overhead is very significant for very large databases. We present new algorithms that reduce the database activity considerably. The idea is to pick a Random sample, to find using this sample all association rules that probably hold in the whole database, and then to verify the results with the rest of the database. The algorithms thus produce exact association rules, not approximations based on a sample. The approach is, however, probabilistic, and in those rare cases where our sampling method does not produce all association rules, the missing rules can be found in a second pass. Our experiments show that the proposed algorithms can find association rules very efficiently in only one database

References

Page 1

	Year	Citations

Page 1