Publication | Closed Access
SciBERT: Pretrained Contextualized Embeddings for Scientific Text
178
Citations
0
References
2019
Year
Unknown Venue
Llm Fine-tuningEngineeringSequence TaggingMultilingual PretrainingCorpus LinguisticsLanguage ProcessingText MiningWord EmbeddingsNatural Language ProcessingInformation RetrievalData ScienceComputational LinguisticsData ResourcesScientific DomainCorpus AnalysisLanguage StudiesMachine TranslationNlp TaskScientific TextPretrained Language ModelPre-trained ModelsMedical Language ProcessingSemantic ParsingRetrieval Augmented GenerationLinguisticsPo Tagging
Obtaining large-scale annotated data for NLP tasks in the scientific domain is challenging and expensive. We release SciBERT, a pretrained language model based on BERT, to address the lack of high‑quality, large‑scale labeled scientific data. SciBERT is pretrained on a large multi‑domain scientific corpus and evaluated on sequence tagging, sentence classification, and dependency parsing tasks, with code and models publicly available. We demonstrate statistically significant improvements over BERT and achieve new state‑of‑the‑art results on several of these tasks.
Obtaining large-scale annotated data for NLP tasks in the scientific domain is challenging and expensive. We release SciBERT, a pretrained language model based on BERT (Devlin et al., 2018) to address the lack of high-quality, large-scale labeled scientific data. SciBERT leverages unsupervised pretraining on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks. We evaluate on a suite of tasks including sequence tagging, sentence classification and dependency parsing, with datasets from a variety of scientific domains. We demonstrate statistically significant improvements over BERT and achieve new state-of-the-art results on several of these tasks. The code and pretrained models are available at this https URL.