The Criteo competition is about ad click prediction. The unpacked training set is 11 GB and has 45 million examples. While we’re not sure if it qualifies as the mythical big data, it’s quite big for Kaggle standards.
Unless you have an adequate machine, it will be difficult to process it in memory. Our solution is to use online or mini-batch learning, which deals with either one example or a small portion of examples at a time. Vowpal Wabbit is especially well suited for the contest for a number of reasons.
Minimizing log loss
The metric for the competition is logarithmic loss, also known as logistic loss or cross-entropy:
The use of log on the error provides extreme punishments for being both confident and wrong. In the worst possible case, a single prediction that something is definitely true when it is actually false will add infinite to your error score and make every other entry pointless.
That’s because log(0)
is minus infinity.
It so happens that VW can minimize this loss function directly, so it won’t go into overly certain predictions:
vw --loss_function logistic ...
With tools minimizing other form of loss function, however, you might want to cap the estimates. In fact even VW predictions seem to benefit slightly from capping at 0.98 (or less) and 0.02.
Lastly, since the loss function is the same, validation error estimates from VW translate closely to Kaggle scores.
Holdout test for validation and early stopping
Since recently Vowpal Wabbit has a new handy feature: built-in holdout set for validation. If you’re doing one pass, it doesn’t matter, because progressive validation is enough. Progressive validation means that the software uses every data point to compute the error before using the point for learning. For multiple passes this amounts to computing the training error - not good. Hence the holdout set - VW sets aside 10% of examples just for validation, keeping the error estimates honest.
This allows early stopping - after each pass VW will check the validation error and if it goes up it will stop training. Therefore you don’t need to worry about specifying the number of passes exactly, just give the upper limit.
When you know how many passes are optimal, you can disable holdout (--holdout_off
) and train on the full set.
Stop and resume training
There is another, more complicated way to implement early stopping: train for one pass (this would be called epoch in neural networks parlance), save the model, get a validation score. Then resume training, save the model again, and so on.
For the first pass:
vw --save_resume -f model ...
It means: save the model with some extra information to allow resuming to a file called model
. For the next passes, you’d load the saved model with -i
to resume training:
vw --save_resume -f model -i model ...
Handling categorical features directly
Some of the features are numeric and some are categorical. More often than not learners accept only numeric features, which means you need to encode categorical variables. With VW you can use them directly. The proper way to go about it is to either create a namespace for each categorical column:
|source Berlin |destination Addis_Ababa
or include the column name (possibly shortened) in the feature name, as a prefix:
source_Berlin destination_Addis_Ababa
The reason is that if you have same values in different columns like source and destination, they will get mixed up if thrown into the same bag:
Berlin Addis_Ababa
- is Berlin source or destination?
In this competition, however, this simpler way, as introduced by Triskelion, seems to work better for some reason.
Printing progress
There is the -P
option to control how often VW prints progress info. Normally it backs off exponentially, that is prints info very often at the beginning and then less and less often. The default itinerary is using powers of two: 1, 2, 4, 8 and so on. This is equivalent to -P 2.0
. Notice the float; if you use an integer, VW will print progress every x points, for example every million with -P 1e6
.
Set -b as high as memory allows
Vowacious Wabbit has many hyperparameters, some more important than others. Usually defaults work pretty well. The surest way to instantly get a better score is to increase amount of memory used for hashing variable names.
It’s important to have a big hashing space when there are many variables in the data, otherwise some names will reduce to the same hash and the software won’t be able to tell one feature from the other. It’s called hash collision.
The -b
option controls the size of the feature table. The size is 2b bits and by default b is 18. 218 is just 262144, so if there are more features than this you are guaranteed to get collisions.
How high one can go depends on how much RAM one has. Apparently with 12GB, for example, you can use -b 29
, but with 8GB -b 28
.
Credit: Gary Larson - The Far Side
As you can see, all this is quite straightforward. The only tricky step is converting data to VW format. Triskelion’s code for that is available at GitHub.