Concepedia

Publication | Closed Access

SMILES-BERT

367

Citations

28

References

2019

Year

TLDR

Deep learning is rapidly advancing drug discovery, especially molecular property prediction, but the limited labeled data in this domain hampers performance. The authors aim to improve molecular property prediction by leveraging large amounts of unlabeled data through a semi‑supervised model, SMILES‑BERT. SMILES‑BERT is a Transformer‑based model pre‑trained on unlabeled SMILES via a masked‑recovery objective and fine‑tuned for downstream property prediction. SMILES‑BERT surpasses state‑of‑the‑art baselines on three datasets, demonstrating the benefit of unsupervised pre‑training and strong generalization.

Abstract

With the rapid progress of AI in both academia and industry, Deep Learning has been widely introduced into various areas in drug discovery to accelerate its pace and cut R&D costs. Among all the problems in drug discovery, molecular property prediction has been one of the most important problems. Unlike general Deep Learning applications, the scale of labeled data is limited in molecular property prediction. To better solve this problem, Deep Learning methods have started focusing on how to utilize tremendous unlabeled data to improve the prediction performance on small-scale labeled data. In this paper, we propose a semi-supervised model named SMILES-BERT, which consists of attention mechanism based Transformer Layer. A large-scale unlabeled data has been used to pre-train the model through a Masked SMILES Recovery task. Then the pre-trained model could easily be generalized into different molecular property prediction tasks via fine-tuning. In the experiments, the proposed SMILES-BERT outperforms the state-of-the-art methods on all three datasets, showing the effectiveness of our unsupervised pre-training and great generalization capability of the pre-trained model.

References

YearCitations

Page 1