Unbiased Recursive Partitioning: A Conditional Inference Framework

TLDR

Recursive binary partitioning is widely used for regression, but exhaustive search methods suffer from overfitting and a bias toward variables with many splits or missing values, and existing unbiased procedures lack a unified theoretical basis. The authors introduce a unified conditional inference framework that embeds tree‑structured regression models within a rigorous theory of conditional inference procedures. The framework implements stopping criteria based on multiple test procedures, applies to all regression types (nominal, ordinal, numeric, censored, multivariate), and is demonstrated on glaucoma, breast cancer survival, and mammography datasets. The unbiased trees produce structurally distinct partitions and achieve prediction accuracy equal to pruned trees with unbiased variable selection, confirming the need for unbiased variable selection.

Abstract

Recursive binary partitioning is a popular tool for regression analysis. Two fundamental problems of exhaustive search procedures usually applied to fit such models have been known for a long time: overfitting and a selection bias towards covariates with many possible splits or missing values. While pruning procedures are able to solve the overfitting problem, the variable selection bias still seriously affects the interpretability of tree-structured regression models. For some special cases unbiased procedures have been suggested, however lacking a common theoretical foundation. We propose a unified framework for recursive partitioning which embeds tree-structured regression models into a well defined theory of conditional inference procedures. Stopping criteria based on multiple test procedures are implemented and it is shown that the predictive performance of the resulting trees is as good as the performance of established exhaustive search procedures. It turns out that the partitions and therefore the models induced by both approaches are structurally different, confirming the need for an unbiased variable selection. Moreover, it is shown that the prediction accuracy of trees with early stopping is equivalent to the prediction accuracy of pruned trees with unbiased variable selection. The methodology presented here is applicable to all kinds of regression problems, including nominal, ordinal, numeric, censored as well as multivariate response variables and arbitrary measurement scales of the covariates. Data from studies on glaucoma classification, node positive breast cancer survival and mammography experience are re-analyzed.

References

Page 1

	Year	Citations

Page 1