Automated alphabet reduction method with evolutionary algorithms for protein structure prediction

Abstract

This paper focuses on automated procedures to reduce the dimensionality ofprotein structure prediction datasets by simplifying the way in which the primary sequence of a protein is represented. The potential benefits ofthis procedure are faster and easier learning process as well as the generationof more compact and human-readable classifiers.The dimensionality reduction procedure we propose consists on the reductionof the 20-letter amino acid (AA) alphabet, which is normally used to specify a protein sequence, into a lower cardinality alphabet. This reduction comes about by a clustering of AA types accordingly to their physical and chemical similarity. Our automated reduction procedure is guided by a fitness function based on the Mutual Information between the AA-based input attributes of the dataset and the protein structure featurethat being predicted. To search for the optimal reduction, the Extended Compact Genetic Algorithm (ECGA) was used, and afterwards the results of this process were fed into (and validated by) BioHEL, a genetics-based machine learningtechnique. BioHEL used the reduced alphabet to induce rules forprotein structure prediction features. BioHEL results are compared to two standard machine learning systems. Our results show that it is possible to reduce the size of the alphabet used for prediction fromtwenty to just three letters resulting in more compact, i.e. interpretable,rules. Also, a protein-wise accuracy performance measure suggests that the loss of accuracy acrued by this substantial alphabet reduction is not statistically significant when compared to the full alphabet.

References

Page 1

	Year	Citations

Page 1