Publicly Available Clinical

TLDR

Contextual word‑embedding models such as ELMo and BERT have markedly improved many NLP tasks, yet they have been minimally explored on specialty corpora like clinical text, and no publicly available pre‑trained BERT models exist for the clinical domain. This study aims to develop and release publicly available BERT models tailored to clinical text, including a generic model and one specifically for discharge summaries. The authors train these models and make them available for use in clinical NLP tasks. The domain‑specific models yield performance gains on 3 of 5 clinical NLP tasks, setting a new state‑of‑the‑art on MedNLI, but perform less well on 2 de‑identification tasks due to differences between de‑identified source text and synthetic non‑de‑identified task text.

Abstract

Contextual word embedding models such as ELMo and BERT have dramatically improved performance for many natural language processing (NLP) tasks in recent months. However, these models have been minimally explored on specialty corpora, such as clinical text; moreover, in the clinical domain, no publicly-available pre-trained BERT models yet exist. In this work, we address this need by exploring and releasing BERT models for clinical text: one for generic clinical text and another for discharge summaries specifically. We demonstrate that using a domain-specific model yields performance improvements on 3/5 clinical NLP tasks, establishing a new state-of-the-art on the MedNLI dataset. We find that these domain-specific models are not as performant on 2 clinical de-identification tasks, and argue that this is a natural consequence of the differences between de-identified source text and synthetically non de-identified task text.

References

Page 1

	Year	Citations

Page 1