Unsupervised Learning With Random Forest Predictors

TLDR

A random forest predictor naturally yields a dissimilarity measure between observations, which can be extended to unlabeled data by distinguishing observed samples from synthetic ones drawn from a reference distribution. The study aims to characterize the RF dissimilarity and provide practical recommendations for its use, highlighting its robustness to mixed variable types, monotonic transformations, and outliers. The method constructs an RF dissimilarity that intrinsically selects variables, weighting each by its dependence on others, enabling efficient handling of high‑dimensional data. The RF dissimilarity effectively detects tumor sample clusters based on marker expressions, with biologically meaningful clusters often definable by simple thresholding, and its properties—handling mixed types, invariance to monotonic transformations, and robustness to outliers—make it attractive for practical use.

Abstract

A random forest (RF) predictor is an ensemble of individual tree predictors. As part of their construction, RF predictors naturally lead to a dissimilarity measure between the observations. One can also define an RF dissimilarity measure between unlabeled data: the idea is to construct an RF predictor that distinguishes the “observed” data from suitably generated synthetic data. The observed data are the original unlabeled data and the synthetic data are drawn from a reference distribution. Here we describe the properties of the RF dissimilarity and make recommendations on how to use it in practice.An RF dissimilarity can be attractive because it handles mixed variable types well, is invariant to monotonic transformations of the input variables, and is robust to outlying observations. The RF dissimilarity easily deals with a large number of variables due to its intrinsic variable selection; for example, the Addcl 1 RF dissimilarity weighs the contribution of each variable according to how dependent it is on other variables.We find that the RF dissimilarity is useful for detecting tumor sample clusters on the basis of tumor marker expressions. In this application, biologically meaningful clusters can often be described with simple thresholding rules.

References

Page 1

	Year	Citations

Page 1