Publication | Closed Access
Count-Based Morgan Fingerprint: A More Efficient and Interpretable Molecular Representation in Developing Machine Learning-Based Predictive Regression Models for Water Contaminants’ Activities and Properties
111
Citations
41
References
2023
Year
Environmental MonitoringMachine LearningEngineeringMachine Learning ToolWater Quality ForecastingBinary Morgan FingerprintEnvironmental ChemistryData ScienceData MiningPattern RecognitionInterpretable Molecular RepresentationPredictive ToxicologyPredictive AnalyticsKnowledge DiscoveryChemometric MethodWater QualityComputer ScienceBioinformaticsTarget PredictionWater AnalysisEnvironmental EngineeringMolecular PropertyComputational BiologyCount-based Morgan Fingerprint
In this study, we introduce the count-based Morgan fingerprint (C-MF) to represent chemical structures of contaminants and develop machine learning (ML)-based predictive models for their activities and properties. Compared with the binary Morgan fingerprint (B-MF), C-MF not only qualifies the presence or absence of an atom group but also quantifies its counts in a molecule. We employ six different ML algorithms (ridge regression, SVM, KNN, RF, XGBoost, and CatBoost) to develop models on 10 contaminant-related data sets based on C-MF and B-MF to compare them in terms of the model's predictive performance, interpretation, and applicability domain (AD). Our results show that C-MF outperforms B-MF in nine of 10 data sets in terms of model predictive performance. The advantage of C-MF over B-MF is dependent on the ML algorithm, and the performance enhancements are proportional to the difference in the chemical diversity of data sets calculated by B-MF and C-MF. Model interpretation results show that the C-MF-based model can elucidate the effect of atom group counts on the target and have a wider range of SHAP values. AD analysis shows that C-MF-based models have an AD similar to that of B-MF-based ones. Finally, we developed a "ContaminaNET" platform to deploy these C-MF-based models for free use.
| Year | Citations | |
|---|---|---|
Page 1
Page 1