Improving wheat yield prediction through variable selection using Support Vector Regression, Random Forest, and Extreme Gradient Boosting

Abstract

• A feature selection process was conducted to identify the most significant factors influencing wheat yield prediction. This process involved correlation analysis and principal component analysis (PCA). • Numerous experiments were carried out, evaluating various combinations of variables in the training set, using support vector regression (SVR), random forest (RF), and extreme gradient boosting (XGBoost) models. • While climatic variables such as precipitation rate, rainfall, and cooling and heating degree-days can influence wheat yield, their impact is less pronounced in irrigated systems. Nevertheless, monitoring these indicators is recommended to better understand crop conditions. • Normalized difference vegetation index (NDVI), biomass (BM), and harvest index (HI) are proposed as the most suitable indices for improving the accuracy of wheat yield prediction. Plant breeding centers, in their relentless pursuit of more productive and resilient wheat varieties, have generated vast data repositories that are fundamental to ensuring global food security. This study uses these data to develop a wheat grain yield (GY) prediction model, using machine learning techniques such as Random Forest (RF), Support Vector Regression (SVR), and Extreme Gradient Boosting (XGBoost). The results obtained prove the potential of RF and XGBoost-based models to accurately predict wheat yield. One of the major challenges of this research was to find the most relevant variables for predicting wheat yield. Using clustering, feature selection, and variable combination techniques, particularly agronomic variables such as harvest index (HI) and biomass (BM), provided complementary information to the Normalized Difference Vegetation Index (NDVI). This combination, analyzed through the XGBoost model, resulted in an exceptional performance, with an RMSE of 28.5082 (grams/square meter) and an R² of 0.9156, showing the constructive collaboration between these indicators. After a thorough analysis, it was discovered that daily clustering and filtering of climatic variables, especially precipitation rate, were favorable in these types of models.

References

Page 1

	Year	Citations

Page 1