Distributional Semantics Resources for Biomedical Text Processing

TLDR

Biomedical literature contains over 5 billion words, and recent unsupervised language processing advances enable building statistical language models and vector‑space representations that aid tasks such as classification, named‑entity recognition, and query expansion. The study introduces the first comprehensive language resources derived from the entire biomedical literature, comprising 1‑to‑5‑gram datasets and semantic word models. These resources are constructed by analyzing all available biomedical literature to compute 1‑to‑5‑gram probabilities and develop new semantic models, and the authors discuss the opportunities and demonstrate their application. All resources introduced in this study are available under open licenses at http://bio.nlplab.org.

Abstract

The openly available biomedical literature contains over 5 billion words in publication abstracts and full texts. Recent advances in unsupervised language processing methods have made it possible to make use of such large unannotated corpora for building statistical language models and inducing high quality vector space representations, which are, in turn, of utility in many tasks such as text classification, named entity recognition and query expansion. In this study, we introduce the first set of such language resources created from analysis of the entire available biomedical literature, including a dataset of all 1to 5-grams and their probabilities in these texts and new models of word semantics. We discuss the opportunities created by these resources and demonstrate their application. All resources introduced in this study are available under open licenses at http://bio.nlplab.org.

References

Page 1

	Year	Citations

Page 1