Publication | Open Access
Generalized Biological Foundation Model with Unified Nucleic Acid and Protein Language
16
Citations
31
References
2024
Year
Unknown Venue
Abstract The language of biology, encoded in DNA, RNA, and proteins, forms the foundation of life but remains challenging to decode due to its complexity. Traditional computational methods often struggle to integrate information across these molecules, limiting a comprehensive understanding of biological systems. Advances in Natural Language Processing (NLP) with pre-trained models offer new possibilities for interpreting biological language. Here, we introduce LucaOne, a pre-trained foundation model trained on nucleic acid and protein sequences from 169,861 species. Through large-scale data integration and semisupervised learning, LucaOne demonstrates an understanding of key biological principles, such as DNA-Protein translation. Using few-shot learning, it effectively comprehends the central dogma of molecular biology and performs competitively on tasks involving DNA, RNA, or protein inputs. Our results highlight the potential of unified foundation models to address complex biological questions, providing an adaptable framework for bioinformatics research and enhancing the interpretation of life’s complexity.
| Year | Citations | |
|---|---|---|
Page 1
Page 1