Learned protein embeddings for machine learning

TLDR

Machine‑learning models trained on protein sequences can predict biological properties of unseen sequences without explicit mechanistic understanding, enabling discovery of optimal sequences, but their performance depends on how sequences are converted into vector representations. The authors propose to learn embedded representations of protein sequences that leverage abundant unmeasured sequence data. They learn low‑dimensional embeddings from large unlabeled protein sequence corpora, producing vector representations that can be used as inputs to downstream models. The learned embeddings are low‑dimensional, simplify downstream modeling, achieve predictive performance comparable to existing representations, require no alignments or structural data, and capture meaningful relationships among proteins. Embedding vectors and code are available at https://github.com/fhalab/embeddings_reproduction/, with supplementary data at Bioinformatics online.

Abstract

Abstract Motivation Machine-learning models trained on protein sequences and their measured functions can infer biological properties of unseen sequences without requiring an understanding of the underlying physical or biological mechanisms. Such models enable the prediction and discovery of sequences with optimal properties. Machine-learning models generally require that their inputs be vectors, and the conversion from a protein sequence to a vector representation affects the model’s ability to learn. We propose to learn embedded representations of protein sequences that take advantage of the vast quantity of unmeasured protein sequence data available. These embeddings are low-dimensional and can greatly simplify downstream modeling. Results The predictive power of Gaussian process models trained using embeddings is comparable to those trained on existing representations, which suggests that embeddings enable accurate predictions despite having orders of magnitude fewer dimensions. Moreover, embeddings are simpler to obtain because they do not require alignments, structural data, or selection of informative amino-acid properties. Visualizing the embedding vectors shows meaningful relationships between the embedded proteins are captured. Availability and implementation The embedding vectors and code to reproduce the results are available at https://github.com/fhalab/embeddings_reproduction/. Supplementary information Supplementary data are available at Bioinformatics online.

References

Page 1

	Year	Citations

Page 1