Publication | Closed Access
The use of web-based statistics to validate, information extraction
16
Citations
9
References
2004
Year
Unknown Venue
EngineeringSemantic WebCorpus LinguisticsText MiningNatural Language ProcessingInformation RetrievalData ScienceData MiningComputational LinguisticsLanguage StudiesContent AnalysisNamed-entity RecognitionStatisticsPointwise Mutual InformationKnowledge DiscoveryWebometricsInformation ExtractionRelationship ExtractionKeyword ExtractionData ExtractionNegative Seeds
The World Wide Web is a powerful and readily available text corpus that can be used effectively to validate the output of an information extraction system. We present experiments that explore how pointwise mutual information (PMI) from search engine hit counts can be used in an Assessor module that assigns a probability that an extracted fact or relationship is correct, thus boosting precision. We find that thresholding on PMI scores is more effective in creating features for the Assessor than using probability density models. Bootstrapping can be effective in finding both positive and negative seeds to train the Assessor, performing better than hand-tagging a sample of actual extractions.
| Year | Citations | |
|---|---|---|
Page 1
Page 1