Deep Dive into Machine Learning Models for Protein Engineering

TLDR

Protein redesign is crucial for drug development, yet the astronomical number of possible mutations makes exhaustive synthesis infeasible; recent advances enable virtual screening with machine learning, though deep learning models and diverse descriptors remain underexplored. The study benchmarks prediction models across various machine learning methods and protein descriptors, including novel single‑amino‑acid and 3D structure‑based descriptors. Models were evaluated on diverse public and proprietary datasets with multiple metrics. Convolutional neural networks using amino‑acid property descriptors proved most broadly applicable for pharmaceutical protein redesign tasks.

Abstract

Protein redesign and engineering has become an important task in pharmaceutical research and development. Recent advances in technology have enabled efficient protein redesign by mimicking natural evolutionary mutation, selection, and amplification steps in the laboratory environment. For any given protein, the number of possible mutations is astronomical. It is impractical to synthesize all sequences or even to investigate all functionally interesting variants. Recently, there has been an increased interest in using machine learning to assist protein redesign, since prediction models can be used to virtually screen a large number of novel sequences. However, many state-of-the-art machine learning models, especially deep learning models, have not been extensively explored. Moreover, only a small selection of protein sequence descriptors has been considered. In this work, the performance of prediction models built using an array of machine learning methods and protein descriptor types, including two novel, single amino acid descriptors and one structure-based three-dimensional descriptor, is benchmarked. The predictions were evaluated on a diverse collection of public and proprietary data sets, using a variety of evaluation metrics. The results of this comparison suggest that Convolution Neural Network models built with amino acid property descriptors are the most widely applicable to the types of protein redesign problems faced in the pharmaceutical industry.

References

Page 1

	Year	Citations

Page 1