Estimating PM<sub>2.5</sub> Concentrations in the Conterminous United States Using the Random Forest Approach

TLDR

Many parametric regression models have been developed to estimate PM₂.₅, but nonparametric machine‑learning algorithms are less common and national‑scale models are rare, while random forests offer high accuracy and interpretability. The study develops a random forest model that uses aerosol optical depth, meteorological fields, and land‑use variables to estimate daily 24‑hour ground‑level PM₂.₅ across the conterminous United States in 2011. The model is a random forest that integrates aerosol optical depth, meteorological data, and land‑use information. Cross‑validation yielded an R² of 0.80 with mean prediction error 1.78 µg/m³ and RMSPE 2.83 µg/m³, comparable to prior neural‑network and regression models, and the addition of convolutional layers for land‑use terms and nearby PM₂.₅ measurements improved R² by ~0.02 and ~0.06, respectively, with AOD and nearby PM₂.₅ identified as the most important predictors.

Abstract

To estimate PM2.5 concentrations, many parametric regression models have been developed, while nonparametric machine learning algorithms are used less often and national-scale models are rare. In this paper, we develop a random forest model incorporating aerosol optical depth (AOD) data, meteorological fields, and land use variables to estimate daily 24 h averaged ground-level PM2.5 concentrations over the conterminous United States in 2011. Random forests are an ensemble learning method that provides predictions with high accuracy and interpretability. Our results achieve an overall cross-validation (CV) R2 value of 0.80. Mean prediction error (MPE) and root mean squared prediction error (RMSPE) for daily predictions are 1.78 and 2.83 μg/m3, respectively, indicating a good agreement between CV predictions and observations. The prediction accuracy of our model is similar to those reported in previous studies using neural networks or regression models on both national and regional scales. In addition, the incorporation of convolutional layers for land use terms and nearby PM2.5 measurements increase CV R2 by ∼0.02 and ∼0.06, respectively, indicating their significant contributions to prediction accuracy. A pair of different variable importance measures both indicate that the convolutional layer for nearby PM2.5 measurements and AOD values are among the most-important predictor variables for the training process.

References

Page 1

	Year	Citations

Page 1