Publication | Closed Access
ViPER
208
Citations
20
References
2005
Year
Unknown Venue
EngineeringSemantic WebCorpus LinguisticsText MiningNatural Language ProcessingInformation RetrievalData ScienceData MiningPattern RecognitionWeb PageKnowledge DiscoveryRepetitive PatternsComputer ScienceInformation ExtractionWeb MiningDom TreeSearch Engine IndexingData ExtractionContent Processing
In this paper we address the problem of unsupervised Web data extraction. We show that unsupervised Web data extraction becomes feasible when supposing pages that are made up of repetitive patterns, as it is the case, e.g., for search engine result pages. Hereby the extraction rules are generated automatically without any training or human interaction, by means of operating on the DOM tree respectively the flat tag token sequence of a single page.Our contribution to automatic data extraction through this paper is twofold. First, we identify and rank potential repetitive patterns with respect to the user's visual perception of the Web page, well aware that location and size of matching elements within a Web page constitute important criteria for defining relevance. Second, matching sub-sequences of the pattern with the highest weightiness are aligned with global multiple sequence alignment techniques. Experimental results show that our system is able to achieve high accuracy in distilling and aligning regularly structured objects inside complex Web pages.
| Year | Citations | |
|---|---|---|
Page 1
Page 1