Unsupervised random forest: a tutorial with case studies

Abstract

Multidimensional data exploration often begins with some form of dimensionality reduction, of which principal component analysis is the most commonly used. This approach, in its traditional implementation, can only capture linear relations, and could hamper the ability of the data analyst to detect important non-linear structure in the data. In this tutorial, we present a relatively unknown and yet powerful alternative method known as Unsupervised Random Forest (URF). URF makes an ingenious use of a simple assumption: if the data that we are modelling holds any structure, it should be distinguishable from a randomly generated dataset. URF does not rely on any distributional assumptions, data attributes (continuous or categorical), or scaling. Similar to its parent method Random Forest, it can model both linear and non-linear relationships. Another advantage of URF is the limited number of parameters to optimize. Low-dimensional visualisation, via the study of the proximity matrix, allows the user to discover patterns and clustering in the data. This tutorial describes not only the underlying theory but also the practical inner workings of URF. Two real data sets demonstrate the potential of URF and provide a basic framework for comparing its performance to other explorative methods. Further research opportunities are also presented. The corresponding codes in R and Matlab are available.