What do you get when you take out backpropagation out of a multilayer perceptron? You get an extreme learning machine, a non-linear model with the speed of a linear one.

ELM is a Chinese invention. Imagine a classic feed-forward neural network with one hidden layer, subtract backpropagation and you have an ELM. The input-hidden weights are constant - they are apparently initialized analytically so even though they are semi-random the thing works. The model learns only the hidden-output weights, which amounts to learning a linear model. Hence the speed.

A North-Korean tractor simulator. It’s an extreme learning machine too.

Actually the perceptron model is only half the solution, at least in David Lambert’s Python-ELM, the software we’ll be using. The other half is a radial basis function network (see The Secret of The Big Guys) based on clustering and distance measures. Normally Python-ELM employs both parts and there’s a mixing parameter *alpha* controlling contribution from one vs the other, but you can use just one if you want to.

Python-ELM features quite a few activation functions, including standard sigmoid and tanh. We have added rectified linear activation, that is *max( 0, x )* for good measure.

There is also a version called GRBF, which according to the paper authors gives strictly better results on the datasets they tested. With our data it works worse and runs slower. The upside is that it has just two hyperparams: the number of hidden units and another parameter, which the paper authors say is best set to 0.05 (*MELM-GRBF: a modified version of the extreme learning machine for generalized radial basis function neural networks* [PDF]).

ELM consists of a hidden layer and a linear output layer. Python-ELM is a complete solution, but instead of standard least squares you can use the `Pipeline`

from scikit-learn to put regularized linear model (or in fact any regressor/classifier) on top of the hidden layer. You can even stack two or more hidden layers, although we’re not sure if it makes sense theoretically. Here’s a basic regression example:

```
rl = RandomLayer( n_hidden = n_hidden, alpha = alpha,
rbf_width = rbf_width, activation_func = activation_func )
elmr = GenELMRegressor( hidden_layer = rl )
elmr.fit( x_train, y_train )
p = elmr.predict( x_test )
```

The same thing, but with Ridge on top:

```
rl = RandomLayer( n_hidden = n_hidden, alpha = alpha,
rbf_width = rbf_width, activation_func = activation_func )
ridge = Ridge( alpha = ridge_alpha )
elmr = pipeline.Pipeline( [( 'rl', rl ), ( 'ridge', ridge )] )
elmr.fit( x_train, y_train )
p = elmr.predict( x_test )
```

There are more examples, both for classification and regression, in elm_notebook.py.

If you’re further interested in the matter, check out a Reddit thread on ELM and this recently published article on Extreme Learning Machines with Julia. Unfortunately the author uses a rather trivial dataset, both in size and difficulty.

In case you’re wondering how to choose `n_hidden`

, `alpha`

, `rbf_width`

and so on, the next article is about optimizing hyperparams.