Recovering semantics of tables on the web

TLDR

The Web contains over 100 million tables whose semantics are rarely explicit, with sparse or uninformative header rows, and the class‑relationship database used for annotation is broad yet noisy. The study presents a system that enriches tables with additional annotations to recover their semantics. The system attaches class labels to columns when enough column values match labels from a Web‑extracted database, and uses a formal evidence‑threshold model to decide when a label is reliable, outperforming a simple majority scheme. The recovered semantics improve table search and related‑table discovery, outperform previous methods, and enable annotation of a substantial fraction of Web tables.

Abstract

The Web offers a corpus of over 100 million tables [6], but the meaning of each table is rarely explicit from the table itself. Header rows exist in few cases and even when they do, the attribute names are typically useless. We describe a system that attempts to recover the semantics of tables by enriching the table with additional annotations. Our annotations facilitate operations such as searching for tables and finding related tables. To recover semantics of tables, we leverage a database of class labels and relationships automatically extracted from the Web. The database of classes and relationships has very wide coverage, but is also noisy. We attach a class label to a column if a sufficient number of the values in the column are identified with that label in the database of class labels, and analogously for binary relationships. We describe a formal model for reasoning about when we have seen sufficient evidence for a label, and show that it performs substantially better than a simple majority scheme. We describe a set of experiments that illustrate the utility of the recovered semantics for table search and show that it performs substantially better than previous approaches. In addition, we characterize what fraction of tables on the Web can be annotated using our approach.

References

Page 1

	Year	Citations

Page 1