Publication | Open Access
DeExcelerator
32
Citations
3
References
2013
Year
Unknown Venue
EngineeringInformation RetrievalData ScienceUnstructured DataStructured DataKnowledge DiscoveryHtml TablesRelational FormManagementData IntegrationSemi-structured DataSemantic WebInformation ExtractionData ManagementStructured DocumentText MiningData ModelingTypical Normalization Problems
Of the structured data published on the web, for instance as datasets on Open Data Platforms such as data.gov, but also in the form of HTML tables on the general web, only a small part is in a relational form. Instead the data is intermingled with formatting, layout and textual metadata, i.e., it is contained in partially structured documents. This makes transformation into a true relational form necessary, which is a precondition for most forms of data analysis and data integration. Studying data.gov as an example source for partially structured documents, we present a classification of typical normalization problems. We then present the DeExcelerator, which is a framework for extracting relations from partially structured documents such as spreadsheets and HTML tables.
| Year | Citations | |
|---|---|---|
Page 1
Page 1