Incorporating non-local information into information extraction systems by Gibbs sampling

TLDR

Most current statistical NLP models rely on local features to enable dynamic programming inference, which limits their ability to capture the long‑distance structure common in language. The study proposes using Gibbs sampling to incorporate non‑local information into statistical NLP models. By replacing Viterbi decoding with simulated annealing in sequence models such as HMMs, CMMs, and CRFs, the authors augment a CRF‑based information extraction system with long‑distance dependency models that enforce label and template consistency while keeping inference tractable. This technique achieves an error reduction of up to 9 % over state‑of‑the‑art systems on two established information extraction tasks.

Abstract

Most current statistical natural language processing models use only local features so as to permit dynamic programming in inference, but this makes them unable to fully account for the long distance structure that is prevalent in language use. We show how to solve this dilemma with Gibbs sampling, a simple Monte Carlo method used to perform approximate inference in factored probabilistic models. By using simulated annealing in place of Viterbi decoding in sequence models such as HMMs, CMMs, and CRFs, it is possible to incorporate non-local structure while preserving tractable inference. We use this technique to augment an existing CRF-based information extraction system with long-distance dependency models, enforcing label consistency and extraction template consistency constraints. This technique results in an error reduction of up to 9% over state-of-the-art systems on two established information extraction tasks.

References

Page 1

	Year	Citations
Optimization by Simulated Annealing Scott Kirkpatrick, C. D. Gelatt, M.P. Vecchi Science Numerical AnalysisLarge-scale Global OptimizationComputational ScienceStatistical MechanicsEngineering	1983	44K
A tutorial on hidden Markov models and selected applications in speech recognition L. R. Rabiner Proceedings of the IEEE EngineeringMachine LearningHidden StatesDiscrete Markov ChainsSpeech Recognition	1989	22.6K
Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images Stuart Geman, Donald Geman IEEE Transactions on Pattern Analysis and Machine Intelligence EngineeringBayesian RestorationMarkov Chain Monte CarloMrf Image ModelDeblurring	1984	17.9K
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty, Andrew McCallum, Fernando C. N. Pereira ScholarlyCommons (University of Pennsylvania)	2001	13K
Simulated annealing: theory and applications. Mathematics and Computers in Simulation Numerical AnalysisModeling And SimulationEngineeringSimulated Annealing	1988	3K
An Introduction to MCMC for Machine Learning Christophe Andrieu, Nando de Freitas, Michael I. Jordan Machine Learning	2003	2.4K
Inducing features of random fields S. Della Pietra, V. Della Pietra, John Lafferty IEEE Transactions on Pattern Analysis and Machine Intelligence Structured PredictionEngineeringMachine LearningTraining SamplesText Mining	1997	1K
Discriminative probabilistic models for relational data Ben Taskar, Pieter Abbeel, Daphne Koller	2002	634
A maximum entropy approach to named entity recognition Ralph Grishman, Andrew Borthwick EngineeringTaggingKnowledge ExtractionRecognition SystemLanguage Processing	1999	467
Named Entity recognition without gazetteers Andrei Mikheev, Marc Moens, Claire Grover EngineeringPart-of-speech TaggingSemantic WebExtensive GazetteersText Mining	1999	427

Page 1