Concepedia

Publication | Closed Access

Scalable sequential pattern mining for biological sequences

102

Citations

16

References

2004

Year

Abstract

Biosequences typically have a small alphabet, a long length, and patterns containing gaps (i.e., "don't care") of arbitrary size. Mining frequent patterns in such sequences faces a different type of explosion than in transaction sequences primarily motivated in market-basket analysis. In this paper, we study how this explosion affects the classic sequential pattern mining, and present a scalable two-phase algorithm to deal with this new explosion. The <i>Segment Phase</i> first searches for short patterns containing no gaps, called <i>segments</i>. This phase is efficient. The <i>Pattern Phase</i> searches for long patterns containing multiple segments separated by variable length gaps. This phase is time consuming. The purpose of two phases is to exploit the information obtained from the first phase to speed up the pattern growth and matching and to prune the search space in the second phase. We evaluate this approach on synthetic and real life data sets.

References

YearCitations

Page 1