Genomic Selection in Plant Breeding: A Comparison of Models

TLDR

Genomic selection (GS) has been shown by simulation and empirical studies to achieve accuracies sufficient for rapid genetic gains, yet the proliferation of GS models has left no comparative analysis to identify the most promising ones. The study aimed to evaluate the predictive ability of existing GS models and several machine learning methods across eight diverse plant datasets. This evaluation compared model accuracies, genomic estimated breeding values, and marker effect estimates using wheat, barley, Arabidopsis, and maize datasets. Results revealed comparable accuracies across many models but wide variation in overfitting, computation time, and marker effect distributions, leading to the recommendation of a reduced set of models—Bayesian Lasso, weighted Bayesian shrinkage regression, and random forest—while noting that linear combinations, bagging, and boosting did not improve accuracy and that subpopulation differences could not always be explained by phenotypic variance or size, supporting GS’s potential to increase genetic gain per unit time and cost.

Abstract

ABSTRACT Simulation and empirical studies of genomic selection (GS) show accuracies sufficient to generate rapid genetic gains. However, with the increased popularity of GS approaches, numerous models have been proposed and no comparative analysis is available to identify the most promising ones. Using eight wheat ( Triticum aestivum L.), barley ( Hordeum vulgare L.), Arabidopsis thaliana (L.) Heynh., and maize ( Zea mays L.) datasets, the predictive ability of currently available GS models along with several machine learning methods was evaluated by comparing accuracies, the genomic estimated breeding values (GEBVs), and the marker effects for each model. While a similar level of accuracy was observed for many models, the level of overfitting varied widely as did the computation time and the distribution of marker effect estimates. Our comparisons suggested that GS in plant breeding programs could be based on a reduced set of models such as the Bayesian Lasso, weighted Bayesian shrinkage regression (wBSR, a fast version of BayesB), and random forest (RF) (a machine learning method that could capture nonadditive effects). Linear combinations of different models were tested as well as bagging and boosting methods, but they did not improve accuracy. This study also showed large differences in accuracy between subpopulations within a dataset that could not always be explained by differences in phenotypic variance and size. The broad diversity of empirical datasets tested here adds evidence that GS could increase genetic gain per unit of time and cost.

References

Page 1

	Year	Citations

Page 1