FastML

Machine learning made easy

Intro to random forests

Let’s step back from forays into cutting edge topics and look at a random forest, one of the most popular machine learning techniques today. Why is it so attractive?

First of all, decision tree ensembles have been found by Caruana et al. as the best overall approach for a variety of problems. Random forests, specifically, perform well both in low dimensional and high dimensional tasks.

There are basically two kinds of tree ensembles: bagged trees and boosted trees. Bagging means that when building each subsequent tree, we don’t look at the earlier trees, while in boosting we consider the earlier trees and strive to compensate for their weaknesses (which may lead to overfitting).

Random forest is an example of the bagging approach, less prone to overfit. Gradient boosted trees (notably GBM package in R) represent the other one. Both are very successful in many applications. Trees are also relatively fast to train, compared to some more involved methods.

Besides effectivness and speed, random forests are easy to use:

  • There are few hyperparams to tune, number of trees being the most important. It’s not difficult to tune it, as usually more is better, up to a certain point. With bigger datasets it’s almost a matter of how many trees you can afford computationally. Having few hyperparams differentiaties random forest from gradient boosted trees, which have more parameters to tweak.

  • You don’t need to shift and scale your data. Shifting means subtracting the mean so that values in each column are centered around zero; scaling means dividing by standard deviance so that the magnitudes of all features are similar. Gradient descent based methods (for example neural networks and SVMs) gain from such pre-processing of data. Not that it’s much work, but still an additional step to perform.

One last thing we will mention is that a random forest generates an internal unbiased estimate of the generalization error as the forest building progresses [Breiman], known as out-of-bag error, or OOBE. It stems from the fact that any given tree only uses a subset of available data for training, and the rest can be used to estimate the error. Thus you can immediately get a rough idea of how the learning goes, even without a validation set.

All this makes random forests one of first choices in supervised learning.

Comments