Machine learning made easy

Go non-linear with Vowpal Wabbit

Vowpal Wabbit now supports a few modes of non-linear supervised learning. They are:

  • a neural network with a single hidden layer
  • automatic creation of polynomial, specifically quadratic and cubic, features
  • N-grams

We describe how to use them, providing examples from the Kaggle Amazon competition and for the kin8nm dataset.

Neural network

The original motivation for creating neural network code in VW was to win some Kaggle competitions using only vee-dub, and that goal becomes much more feasible once you have a strong non-linear learner.

The network seems to be a classic multi-layer perceptron with one sigmoidal hidden layer. More interestingly, it has dropout. Unfortunately, in a few tries we haven’t had much luck with the dropout.

Here’s an example of how to create a network with 10 hidden units:

vw -d data.vw --nn 10

Quadratic and cubic features

The idea of quadratic features is to create all possible combinations between original features, so that instead of d features you end up with d2 features. Or with d3 in cubic mode.

This poses a danger of overfitting if you have many features to start with. It might help to apply --l1 or --l2 regularization then. And to avoid feature collisions, increase -b.

A word of explanation about feature hashing and hash collisions: VW hashes feature names into a 2b dimensional space. By default it uses 18 bits, so that’s about 262k possible features. If you have more than that, they will collide, meaning that the software won’t be able to distinguish between some of them. Fortunately you can increase a number of bits used for hashing so that you can get millions of features.

With polynomial features you need to supply a namespace. This allows for combining features from selected namespaces. Read more about this in the -q section of VW’s command line arguments and in the description of the input format. The quick version is that you can have just one namespace and combine it with itself.

A basic scenario for one namespace called “a”. You could create quadratic features like this:

vw -d data.vw -q aa

Cubic features must involve three sets:

vw -d data.vw --cubic aaa

Polynomial features can be combined with a neural network or used separately.


N-grams are contiguous sequences of n items from a given sequence. They are useful for modelling text beyond a bag of words. You can feed pretty raw text to Vowpal Wabbit, for example: “I won’t use random forest, I promise”. Then 2-grams, or bigrams, would be: “I won’t”, “won’t use”, “use random” and so on. Again, typically used for text.

vw -d data.vw --ngram 2

Amazon Example

In the previous article we talked about the Amazon access control challenge at Kaggle. The score we presented there is no longer competitive. Here’s a way to beef it up a notch, to get AUC around 0.895:

vw -d data/train.vw -k -c -f data/model --loss_function logistic -b 25 --passes 20 -q ee --l2 0.0000005

The changes are:

  • add quadratic features with -q ee
  • increase a number of passes to 20 - a number we experimentally found to work pretty well
  • add some L2 regularization to avoid or at least reduce overfitting. Again, the number was found experimentally.
  • use 25 bits for hashing (or more, if you can) to reduce feature hash collisions. Many features make collisions more likely.

By the way, since the last time a few people have posted their solutions in the competition forum. Some scores are similiar to ours, some are better.


This set is a highly non-linear, medium noisy version of simulated kinematics of a robot arm data. We’ve used it before and now we employ the same custom 80/10/10 split among 8192 examples. The files are available at Github. See commands.txt for invocations and scores.

a robot with a broken arm
Image credit: Twitter

kin8nm is a quite popular benchmark set for Gaussian Process regression, which is generally state of the art, unfortunately not very scalable. With GP, you can get a RMSE close to 0.05. Random forest might give you 0.14. With k-means mapping and a linear model, we got 0.09.

How does VW fare? After some tweaking it seems that you can around 0.094 with VW’s --nn. When you add second order polynomial features with -q, you can go below 0.08. That seems pretty good to us, although not quite in the “win a competition” zone.