FastML

Machine learning made easy

Tuning hyperparams fast with Hyperband

Hyperband is a relatively new method for tuning iterative algorithms. It performs random sampling and attempts to gain the edge by using time spent optimizing in the best way. We explain a few things that were not clear to us right away, and try the algorithm in practice.

Candidates for tuning with Hyperband include all the SGD derivatives - meaning the whole deep learning - and tree ensembles: gradient boosting, and perhaps to a lesser extent, random forest and extremely randomized trees. In other words, the most important supervised methods in use today.

The idea is to try a large number of random configurations:

while the Bayesian Methods perhaps consistently outperform random sampling, they do so only by a negligible amount. To quantify this idea, we compare to random run at twice the speed which beats the two Bayesian Optimization methods, i.e., running random search for twice as long yields superior results.


TPE in this chart is the Tree of Parzen Estimators from HyperOpt

Trying all these configurations takes time. If you ever tuned parameters by hand, you know that for some sets of params, you can tell right from the start that they won’t be good. Still, popular tools take it to the bitter end and run for a prescribed number of iterations to get a score.

To solve this problem, Hyperband runs configs for just an iteration or two at first, to get a taste of how they perform. Then it takes the best performers and runs them longer. Indeed, that’s all Hyperband does: run random configurations on a specific schedule of iterations per configuration, using earlier results to select candidates for longer runs.

See the table below for an example of such schedule (the default). It starts with 81 runs, one iteration each. Then the best 27 configurations get three iterations each. Then the best nine get nine, and so on. After all runs are complete, the algorithm returns a best configuration found so far and you can run it all over again.

max_iter = 81        s=4             s=3             s=2             s=1             s=0
eta = 3              n_i   r_i       n_i   r_i       n_i   r_i       n_i   r_i       n_i   r_i
B = 5*max_iter       ---------       ---------       ---------       ---------       ---------
                      81    1         27    3         9     9         6     27        5     81
                      27    3         9     9         3     27        2     81
                      9     9         3     27        1     81
                      3     27        1     81
                      1     81

The schedule depends on two main parameters, max_iter and eta. s is derived from these two and dictates the number of rounds. As you can see, the authors arranged things so that there are no floats, only integers - reportedly based on masonic numerology (you know, ordo ab chao). They don’t mention it overtly and that’s the way it should be - some things must be kept confidential (then again, some don’t). But you surely wouldn’t perform the level-III underhand “pull-tap-pull” handshake in front of the cameras, would you? Reckless!

ORDO AB CHAO

Should you choose other values for these params, there will be fractionals in the number of iterations, but it’s not a problem, really, because an iteration is not what you think it is.

The term iteration is meant to indicate a single unit of computation (e.g. an iteration could be .5 epochs over the dataset) and without loss of generality min_iter=1. Consequently, the length of an iteration should be chosen to be the minimum amount of computation where different hyperparameter configurations start to separate (or where it is clear that some settings diverge).

For tree-based methods, an iteration will be a number of trees, let’s say 5 or 10.

even if performance after a small number of iterations is very unrepresentative of the configurations absolute performance, its relative performance compared with many alternatives trained with the same number of iterations is roughly maintained.

By the way: in random forest, more trees serve to reduce variance, so this assumption may be slightly less valid than for other methods. With just a few trees, the difference between configurations might just reflect noise. With more trees, true performance reveals itself. This dynamic can possibly offset Hyperband’s hedging.

There are obvious counter-examples; for instance if learning-rate/step-size is a hyperparameter, smaller values will likely appear to perform worse for a small number of iterations but may outperform the pack after a large number of iterations.

Which leads us to the wickest point of the system: experiments involving tuning a learning rate.

Learning rate - to tune or not to tune?

Hyperband is not a silver bullet.

In practice, when learning rate is a parameter, Hyperband finds configurations that converge quickly. But by the same token, it’s unlikely to find good “low learning rate with many iterations” combos. If you follow developments on Kaggle, you know that people often run XGBoost with precisely this setup to get the best results.

Even though authors say they address this problem by hedging, in default setup there are only a few configurations running for max. iterations (the last round: 5 x 81). All others are pre-selected with a shorter number of iterations. We think that random search among five configurations is unlikely to hit the best stuff.

If you do not tune learning rate, the Hyperband algorithm makes good sense. On the other hand, one could get rid of the last round because hedging is not necessary. Good configs will be pre-selected earlier, no need for blind random search. So, if you cut out two of the five main loops, you save 40% of time, but only forfeit checking 13 configurations.

Even better, one could discard the last tier (1 x 81, 2 x 81, etc.) in each round, including the last round. This drastically reduces time needed. We provide this option in our code.

Implementation

The way it works, you give Hyperband two functions: one that returns a random configuration, and one that trains that configuration for a given number of iterations and returns a loss value. We call them get_params() and try_params() respectively.

To define a search space and sample from it, we use hyperopt - no sense in reinventing the wheel. Of course if you don’t like it, you’re free to implement get_params() in any way you choose.

Here’s what a space for GradientBoostingClassifier might look like:

space = {
    'learning_rate': hp.uniform( 'lr', 0.01, 0.2 ),
    'subsample': hp.uniform( 'ss', 0.8, 1.0 ),
    'max_depth': hp.quniform( 'md', 2, 10, 1 ),
    'max_features': hp.choice( 'mf', ( 'sqrt', 'log2', None )),
    'min_samples_leaf': hp.quniform( 'mss', 1, 10, 1 ),
    'min_samples_split': hp.quniform( 'mss', 2, 20, 1 )
}

The hyperparams are straight from the manual. The distributions (hp.uniform, hp.quniform, hp.choice etc.) are described in detail in the Hyperopt wiki. In short:

'learning_rate': hp.uniform( 'lr', 0.01, 0.2 )

Learning rate is to be sampled from a uniform distribution. The first argument, lr, is a label, which, frankly, we don’t care about much. Apparently Hyperopt needs them. After a label we say that learning rate can vary from 0.01 to 0.2.

'max_features': hp.choice( 'mf', ( 'sqrt', 'log2', None ))

Max. features is a categorical variable, and the possible values are ‘sqrt’, ‘log2’, or None. In this example None stands for “no maximum”, meaning all features.

'max_depth': hp.quniform( 'md', 2, 10, 1 )

Some variables, like the number of trees, or max. depth of a single tree, are integers, not floats. Therefore we use hp.quniform (quantized uniform) with parameter q (the last one) = 1. Should we need values like 4, 8, 12, 16, we’d use q = 4.

There are other handy distributions, specifically log uniform, but these three are the most important.

From our experiments on one dataset, using all features (max_features = None) emerged as a winner. It also looks like increasing min_samples_leaf and min_samples_split, the params to curb overfitting, might help. They interact with max_depth, which works in the other direction. The point is, you don’t need to discover all this by hand - let the computer do the work.

Here’s our parameter space for random forest and/or extremely randomized trees:

space = {
    'criterion': hp.choice( 'c', ( 'gini', 'entropy' )),
    'bootstrap': hp.choice( 'b', ( True, False )),
    'class_weight': hp.choice( 'cw', ( 'balanced', 'balanced_subsample', None )),
    'max_depth': hp.quniform( 'md', 2, 10, 1 ),
    'max_features': hp.choice( 'mf', ( 'sqrt', 'log2', None )),
    'min_samples_split': hp.quniform( 'msp', 2, 20, 1 ),
    'min_samples_leaf': hp.quniform( 'msl', 1, 10, 1 ),
}

That was the hard bit, the rest is really easy. The authors kindly provide source code, which is just a snippet of Python. We build on this piece to provide a fully functional implementation, which you can find at GitHub.

P.S.

“Band” in the name stands for “bandit”.

Comments