Publication | Closed Access
Scalable sequential pattern mining for biological sequences
102
Citations
16
References
2004
Year
Unknown Venue
EngineeringPattern DiscoveryPattern MiningMining MethodsPattern GrowthKnowledge Discovery In DatabasesInformation RetrievalData ScienceData MiningSearch SpaceBiological SequencesKnowledge DiscoveryComputer ScienceMining Frequent PatternsFunctional GenomicsBioinformaticsFrequent Pattern MiningComputational BiologyCombinatorial Pattern MatchingStructure MiningSystems BiologyMedicine
Biosequences typically have a small alphabet, a long length, and patterns containing gaps (i.e., "don't care") of arbitrary size. Mining frequent patterns in such sequences faces a different type of explosion than in transaction sequences primarily motivated in market-basket analysis. In this paper, we study how this explosion affects the classic sequential pattern mining, and present a scalable two-phase algorithm to deal with this new explosion. The <i>Segment Phase</i> first searches for short patterns containing no gaps, called <i>segments</i>. This phase is efficient. The <i>Pattern Phase</i> searches for long patterns containing multiple segments separated by variable length gaps. This phase is time consuming. The purpose of two phases is to exploit the information obtained from the first phase to speed up the pattern growth and matching and to prune the search space in the second phase. We evaluate this approach on synthetic and real life data sets.
| Year | Citations | |
|---|---|---|
Page 1
Page 1