Publication | Closed Access
Probabilistic models for focused web crawling
36
Citations
43
References
2004
Year
Unknown Venue
EngineeringSemantic WebFocused Web CrawlingCorpus LinguisticsText MiningNatural Language ProcessingInformation RetrievalData ScienceData MiningIntelligent SearchingSearch TechnologyKnowledge DiscoveryWebometricsProbability TheoryComputer ScienceFocused WebQuery AnalysisSearch Engine DesignFocused CrawlerWeb MiningSearch Engine IndexingHidden Markov Models
A Focused crawler must use information gleaned from previously crawled page sequences to estimate the relevance of a newly seen URL. Therefore, good performance depends on powerful modelling of context as well as the current observations. Probabilistic models, such as Hidden Markov Models(HMMs) and Conditional Random Fields(CRFs), can potentially capture both formatting and context. In this paper, we present the use of HMM for focused web crawling, and compare it with Best-First strategy. Furthermore, we discuss the concept of using CRFs to overcome the difficulties with HMMs and support the use of many, arbitrary and overlapping features. Finally, we describe a design of a system applying CRFs for focused web crawling, that is currently being implemented.
| Year | Citations | |
|---|---|---|
Page 1
Page 1