Publication | Open Access
A strategy to apply machine learning to small datasets in materials science
694
Citations
54
References
2018
Year
EngineeringMachine LearningMachine Learning ToolMaterial SimulationComputational Nanostructure ModelingPrecision–dof AssociationSmall DatasetsHigher DofData ScienceData MiningPattern RecognitionMaterials ScienceCrystalline DefectsMachine Learning ModelPredictive AnalyticsKnowledge DiscoveryComputer ScienceComputational ModelingData ClassificationBinary SemiconductorsMaterials CharacterizationApplied PhysicsMaterial ModelingClassifier System
Interest in applying machine learning to materials science is growing, yet the impact of small, diverse datasets on model performance remains unexplored. The study investigates how data availability influences machine learning accuracy and proposes adding crude property estimates to the feature space to enable reliable models with limited data. The authors show that data size affects model precision through the degree of freedom, and that incorporating crude estimates reduces DoF and mitigates underfitting. Applying this strategy to band‑gap, lattice thermal conductivity, and zeolite elasticity predictions raised accuracy to state‑of‑the‑art levels, confirming its general applicability.
Abstract There is growing interest in applying machine learning techniques in the research of materials science. However, although it is recognized that materials datasets are typically smaller and sometimes more diverse compared to other fields, the influence of availability of materials data on training machine learning models has not yet been studied, which prevents the possibility to establish accurate predictive rules using small materials datasets. Here we analyzed the fundamental interplay between the availability of materials data and the predictive capability of machine learning models. Instead of affecting the model precision directly, the effect of data size is mediated by the degree of freedom (DoF) of model, resulting in the phenomenon of association between precision and DoF. The appearance of precision–DoF association signals the issue of underfitting and is characterized by large bias of prediction, which consequently restricts the accurate prediction in unknown domains. We proposed to incorporate the crude estimation of property in the feature space to establish ML models using small sized materials data, which increases the accuracy of prediction without the cost of higher DoF. In three case studies of predicting the band gap of binary semiconductors, lattice thermal conductivity, and elastic properties of zeolites, the integration of crude estimation effectively boosted the predictive capability of machine learning models to state-of-art levels, demonstrating the generality of the proposed strategy to construct accurate machine learning models using small materials dataset.
| Year | Citations | |
|---|---|---|
Page 1
Page 1