Publication | Closed Access
PruSM
20
Citations
15
References
2010
Year
Unknown Venue
EngineeringSemantic WebText MiningInformation RetrievalData ScienceData MiningManagementData IntegrationData RetrievalSchema MatchingData ManagementSearch TechnologyForm UnderstandingKnowledge DiscoveryComputer ScienceSearch Engine DesignQuery OptimizationForm ElementsForm SchemataSearch Engine Indexing
There has been a substantial increase in the number of Web data sources whose contents are hidden and can only be accessed through form interfaces. To leverage this data, several applications have emerged that aim to automate and simplify the access to these data sources, from hidden-Web crawlers and meta-searchers to Web information integration systems. A requirement shared by these applications is the ability to understand these forms, so that they can automatically fill them out. In this paper, we address a key problem in form understanding: how to match elements across distinct forms. Although this problem has been studied in the literature, existing approaches have important limitations. Notably, they only handle small form collections and assume that form elements are clean and normalized, often through manual pre-processing. When a large number of forms is automatically gathered, matching form schemata presents new challenges: data heterogeneity is compounded with the Web-scale and noise introduced by automated processes. We propose PruSM, a prudent schema matching strategy the determines matches for form elements in a prudent fashion, with the goal of minimizing error propagation. A experimental evaluation of PruSM using widely available data sets shows that the approach effective and able to accurately match a large number of form schemata and without requiring any manual pre-processing.
| Year | Citations | |
|---|---|---|
Page 1
Page 1