Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics

Abstract

Abstract The random forest (RF) algorithm by Leo Breiman has become a standard data analysis tool in bioinformatics. It has shown excellent performance in settings where the number of variables is much larger than the number of observations, can cope with complex interaction structures as well as highly correlated variables and return measures of variable importance. This paper synthesizes 10 years of RF development with emphasis on applications to bioinformatics and computational biology. Special attention is paid to practical aspects such as the selection of parameters, available RF implementations, and important pitfalls and biases of RF and its variable importance measures (VIMs). The paper surveys recent developments of the methodology relevant to bioinformatics as well as some representative examples of RF applications in this context and possible directions for future research. © 2012 Wiley Periodicals, Inc. This article is categorized under: Algorithmic Development > Hierarchies and Trees Algorithmic Development > Statistics Application Areas > Health Care

References

Page 1

	Year	Citations
Random Forests Leo Breiman Machine Learning	2001	119.3K
Classification and Regression Trees. Alexander Gordon, Leo Breiman, Jerome H. Friedman, Biometrics Data Analysis MethodEngineeringMachine LearningData ScienceData Mining	1984	23.8K
Classification and Regression Trees. John Van Ryzin, Leo Breiman, Jerome H. Friedman, Journal of the American Statistical Association Data ClassificationEngineeringMachine LearningData ScienceData Mining	1986	21K
Classification and Regression by randomForest Andy Liaw, Matthew C. Wiener Artificial IntelligenceEngineeringMachine LearningMining MethodsClassification Trees	2007	18.4K
The WEKA data mining software Mark Hall, Eibe Frank, Geoffrey Holmes, ACM SIGKDD Explorations Newsletter EngineeringPattern MiningText MiningOptimization-based Data MiningWeka Workbench	2009	17.8K
Data mining and knowledge discovery: making sense out of data U.M. Feyyad IEEE Expert EngineeringBusiness IntelligenceAvailable DataPattern MiningData Infrastructure	1996	4.6K
RANDOM FORESTS FOR CLASSIFICATION IN ECOLOGY D. Richard Cutler, Thomas C. Edwards, Karen H. Beard, Ecology BiodiversityInvasive SpecieEngineeringForest BiometricsInvasion Biology	2007	4.6K
Unbiased Recursive Partitioning: A Conditional Inference Framework Torsten Hothorn, Kurt Hornik, Achim Zeileis Journal of Computational and Graphical Statistics Recursive Binary PartitioningEngineeringInductive InferenceBayesian InferenceData Science	2006	4K
Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling Vladimir Svetnik, Andy Liaw, Christopher Tong, Journal of Chemical Information and Computer Sciences EngineeringMachine LearningHit IdentificationMachine Learning ToolFeature Selection	2003	3.5K
Bias in random forest variable importance measures: Illustrations, sources and a solution Carolin Strobl, Anne‐Laure Boulesteix, Achim Zeileis, BMC Bioinformatics	2007	3.5K

Page 1