Publication | Closed Access
Extracting schema from semistructured data
39
Citations
8
References
1998
Year
EngineeringKnowledge ExtractionStructured DataSemantic WebSemanticsText MiningNatural Language ProcessingInformation RetrievalData ScienceComputational LinguisticsManagementData IntegrationSemi-structured DataKnowledge DiscoveryFixed SchemaComputer ScienceDatabase TheoryRigid SchemaSemistructured DataAutomated ReasoningData ExtractionComputational SemanticsData Modeling
Semistructured data lacks a fixed schema, making extraction easy but hindering presentation and querying. The study aims to discover implicit structure in semistructured data and recast raw data accordingly. We model semistructured data as labeled directed graphs, type them via greatest fixpoint semantics of monadic datalog, and propose an algorithm for approximate typing. We prove that optimal typing is NP‑hard, but heuristics based on clustering yield efficient near‑optimal solutions, with preliminary experiments supporting the approach.
Semistructured data is characterized by the lack of any fixed and rigid schema, although typically the data has some implicit structure. While the lack of fixed schema makes extracting semistructured data fairly easy and an attractive goal, presenting and querying such data is greatly impaired. Thus, a critical problem is the discovery of the structure implicit in semistructured data and, subsequently, the recasting of the raw data in terms of this structure. In this paper, we consider a very general form of semistructured data based on labeled, directed graphs. We show that such data can be typed using the greatest fixpoint semantics of monadic datalog programs. We present an algorithm for approximate typing of semistructured data. We establish that the general problem of finding an optimal such typing is NP-hard, but present some heuristics and techniques based on clustering that allow efficient and near-optimal treatment of the problem. We also present some preliminary experimental results.
| Year | Citations | |
|---|---|---|
Page 1
Page 1