An effective hash-based algorithm for mining association rules

TLDR

Mining association rules is equivalent to finding large itemsets—groups of items appearing in many transactions—typically performed iteratively by increasing itemset size, but early iterations are dominated by the need to sift through many candidate itemsets, which hampers overall performance. The authors aim to improve association rule mining in large sales databases by proposing an effective hash‑based algorithm for generating candidate itemsets. They construct a candidate set of itemsets and use a hash‑based method to efficiently generate and filter candidates, then evaluate the algorithm’s performance through extensive simulations. The hash‑based algorithm reduces the number of candidate 2‑itemsets by orders of magnitude, which trims the transaction database early and significantly lowers computational cost in later iterations.

Abstract

In this paper, we examine the issue of mining association rules among items in a large database of sales transactions. The mining of association rules can be mapped into the problem of discovering large itemsets where a large itemset is a group of items which appear in a sufficient number of transactions. The problem of discovering large itemsets can be solved by constructing a candidate set of itemsets first and then, identifying, within this candidate set, those itemsets that meet the large itemset requirement. Generally this is done iteratively for each large k -itemset in increasing order of k where a large k -itemset is a large itemset with k items. To determine large itemsets from a huge number of candidate large itemsets in early iterations is usually the dominating factor for the overall data mining performance. To address this issue, we propose an effective hash-based algorithm for the candidate set generation. Explicitly, the number of candidate 2-itemsets generated by the proposed algorithm is, in orders of magnitude, smaller than that by previous methods, thus resolving the performance bottleneck. Note that the generation of smaller candidate sets enables us to effectively trim the transaction database size at a much earlier stage of the iterations, thereby reducing the computational cost for later iterations significantly. Extensive simulation study is conducted to evaluate performance of the proposed algorithm.

References

Page 1

	Year	Citations

Page 1