Data Mining and Machine Learning Techniques for the Identification of Mutagenicity Inducing Substructures and Structure Activity Relationships of Noncongeneric Compounds

TLDR

The study investigates using data mining and machine learning to derive mutagenicity structure‑activity relationships from noncongeneric compound datasets. The authors compare a novel MOLFEA descriptor‑generation algorithm with conventional property‑based SAR methods, evaluate multiple machine learning techniques for model induction, and optimize parameters while interpreting the resulting models. Models using MOLFEA descriptors achieved 10–15 % higher predictive accuracy than property‑based models, with PART and SVM yielding the best results and reaching up to 78 % accuracy in 10‑fold cross‑validation, yet combining descriptors offered no improvement and the models remained interpretable for prediction and explanation.

Abstract

This paper explores the utility of data mining and machine learning algorithms for the induction of mutagenicity structure-activity relationships (SARs) from noncongeneric data sets. We compare (i) a newly developed algorithm (MOLFEA) for the generation of descriptors (molecular fragments) for noncongeneric compounds with traditional SAR approaches (molecular properties) and (ii) different machine learning algorithms for the induction of SARs from these descriptors. In addition we investigate the optimal parameter settings for these programs and give an exemplary interpretation of the derived models. The predictive accuracies of models using MOLFEA derived descriptors is approximately 10-15%age points higher than those using molecular properties alone. Using both types of descriptors together does not improve the derived models. From the applied machine learning techniques the rule learner PART and support vector machines gave the best results, although the differences between the learning algorithms are only marginal. We were able to achieve predictive accuracies up to 78% for 10-fold cross-validation. The resulting models are relatively easy to interpret and usable for predictive as well as for explanatory purposes.

References

Page 1

	Year	Citations

Page 1