Publicly Available Clinical BERT Embeddings

TLDR

Contextual word embeddings such as ELMo and BERT have dramatically improved NLP performance, yet their application to clinical text remains limited and no publicly available pre‑trained clinical BERT models exist. This study develops and releases two clinical BERT models—one for general clinical text and another for discharge summaries—to fill this gap. The authors trained BERT on large clinical corpora, producing a generic model and a discharge‑summary‑specific model, and made them publicly available. Using these domain‑specific models improves performance on three typical clinical NLP tasks, but they underperform on two de‑identification tasks, likely due to differences between de‑identified source text and synthetic task data.

Abstract

Contextual word embedding models such as ELMo (Peters et al., 2018) and BERT (Devlin et al., 2018) have dramatically improved performance for many natural language processing (NLP) tasks in recent months. However, these models have been minimally explored on specialty corpora, such as clinical text; moreover, in the clinical domain, no publicly-available pre-trained BERT models yet exist. In this work, we address this need by exploring and releasing BERT models for clinical text: one for generic clinical text and another for discharge summaries specifically. We demonstrate that using a domain-specific model yields performance improvements on three common clinical NLP tasks as compared to nonspecific embeddings. These domain-specific models are not as performant on two clinical de-identification tasks, and argue that this is a natural consequence of the differences between de-identified source text and synthetically non de-identified task text.

References

Page 1

	Year	Citations

Page 1