Our Indiegogo campaign turned out to be a (partial) success, so we deliver as promised: a comparison of Vowpal Wabbit, Liblinear/SBM and StreamSVM on the webspam dataset. Refer to the Comparing large-scale linear learners for motivation and references.
Just to recap: the Liblinear/SBM and StreamSVM papers deal with linear learners able to handle data which doesn’t fit into memory. Among other things, the authors experiment with various tools and find VW not quite competitive. We set out to clarify the situation using current versions of the software involved.
Data
The authors of the Liblinear/SBM paper run Vowpal Wabbit on two datasets, so it narrows down the selection as far as reproducing the results goes. These two sets are webspam and kdd10. We have chosen the first because it’s bigger: 24GB vs 5GB. The downside of webspam is that it’s easy to get 99% accuracy, while kdd10 is more demanding, with accuracy in 85-90% range.
Webspam has 350k rows and 16.6M features. It is available from the LibSVM dataset repository and was originally used in Pascal Large Scale Learning Challenge. The underlying data consists of web pages HTML source, classified as spam or not. We’re dealing with the trigram version, there’s also a unigram version with much smaller number of features. We split the data randomly into training and test: 270k / 80k examples, as was done in the papers.
python split.py train.txt train_v.txt test_v.txt 0.8 dat_seed
split.py
is a phraug script, it will randomly split the given file into two, 80% of lines will go to the first, the rest to the second. We use a random seed so that you can get exactly the same files.
Vowpal Wabbit
Since data is in libsvm format, we need to convert it for VW. The format in this case would differ from libsvm only by something like “|x” between the label and the rest of the line. To speed things up, let’s skip the text file and save data as VW’s binary/cache file. Instead of saving the output to a file, we send it to standard output (Unix territory) and pipe it directly to VW, which will train and create a cache file as a side effect we seek.
python libsvm2vw.py train_v.txt /dev/stdout | vw --cache_file train_v.cache
python libsvm2vw.py test_v.txt /dev/stdout | vw --cache_file test_v.cache
UPDATE: contents of the cache depend on the number of bits used for hash table (-b
). One can use existing cache for training with fewer bits, but not vice-versa. Therefore it makes sense to create the cache with higher bits:
python libsvm2vw.py train_v.txt /dev/stdout | vw -b 29 --cache_file train_v.cache
python libsvm2vw.py test_v.txt /dev/stdout | vw -b 29 --cache_file test_v.cache
And now we find the optimal number of passes using VW’s built-in validation/early-stopping mode:
vw --cache_file train_v.cache --loss_function logistic -b 29 -P 10000 --passes 10
(...)
finished run
number of examples per pass = 252163
passes used = 5
weighted example sum = 1.26082e+06
weighted label sum = 267743
average loss = 0.0341765 h
best constant = 0.212357
total feature number = 4703937821
Then disable it (--holdout_off
) to train on the full file:
$ time vw --cache_file train_v.cache --loss_function logistic -b 29 -P 10000 --passes 5 --holdout_off -f model_v_logistic
(...)
finished run
number of examples per pass = 280130
passes used = 5
weighted example sum = 1.40065e+06
weighted label sum = 298340
average loss = 0.0178536
best constant = 0.213001
total feature number = 5225004535
real 4m10.574s
user 2m55.532s
sys 0m14.009s
See Holdout test for validation and early stopping for explanation of this procedure.
Once we have predictions, we can score them using the code available at GitHub:
accuracy: 0.994675826535
AUC: 0.998418419401
We’ll be comparing classifiers’ accuracy, so another option is to use hinge
loss instead of logistic. The procedure is the same, scores are slightly worse.
accuracy: 0.993502218406
AUC: 0.99632599445
Regular Liblinear
For starters let’s just try a tool that’s not designed to handle data bigger than memory. When you run regular Liblinear’s train on webspam, nothing happens for a long time. If you look at system monitoring tools, iotop shows intensive disk read (60-90 M/s, probably max for this system), while CPU is at 20-40% of one core. Then after a while CPU goes to roughly full utilization for one core, while memory usage starts to climb. That’s probably a second pass over data and loading it. Then CPU utilization starts to show a lot of irregularity, and swap usage starts to climb and the system becomes intermittently unresponsive. All this time train didn’t output a word. It’s time to kill the process and try Liblinear/SBM.
Liblinear/SBM
We downloaded and built the software according to the instructions in the README, no rocket science here. The software is called liblinear-cdblock and is a fork of Liblinear, so if you need other non-standard extensions you’re out of luck. Instead of the train program in the original you get blocktrain, and additionally blocksplit.
First thing to do is to blocksplit a training file into a number of blocks. Each block should fit into memory. The default number of blocks is eight. With 12GB memory and webspam you can get away with just two blocks.
The process takes one core to roughly 100%, utilizes practically no memory and creates a folder with blocks in the current directory. The folder is named after the input file with a dot and a number of blocks at the end, for example train_v.txt.2
.
$ time ./blocksplit -m 2 data/train_v.txt
............................
time : 822
real 13m42.357s
user 12m8.597s
sys 0m10.435s
The software outputs a series of dots so at least you know something is going on.
Now it’s time to run blocktrain. We use the default settings. CPU gets maxed out, memory usage grows. After a while the software spits out a line, the first iteration stats, including cumulative run time, value of the objective function and of the gap (whatever it is). The default setting for max iterations is five, and the first takes roughly two minuts on our hardware. The next iterations are progressively quicker. The objective stays the same after the second iteration, only the gap gets smaller.
In the middle of each iteration CPU util falls to 0 for a couple of seconds. This is synchronized with memory usage pattern.
$ time ./blocktrain data/train_v.txt.2/ data/cdblock_model
iter 1 time 118 runtime 118 loadtime 0 obj -9213.28 PGmax 0.015024 PGmin -0.01629 gap 0.031314
iter 2 time 228 runtime 228 loadtime 0 obj -9234.21 PGmax 0.013591 PGmin -0.011198 gap 0.02479
iter 3 time 315 runtime 315 loadtime 0 obj -9234.22 PGmax 0.009242 PGmin -0.027316 gap 0.036558
iter 4 time 402 runtime 402 loadtime 0 obj -9234.22 PGmax 0.0038895 PGmin -0.010379 gap 0.014268
iter 5 time 488 runtime 488 loadtime 0 obj -9234.23 PGmax 0.0017845 PGmin -0.0024688 gap 0.0042533
real 8m14.249s
user 5m17.476s
sys 0m18.643s
Predict seems to support probability estimates, but only with logistic regression. Otherwise you get a bunch of ones and negative ones.
$ time ./predict data/test_v.txt data/cdblock_model data/cdblock_p.txt
Accuracy = 99.3245% (69398/69870)
real 1m5.930s
user 1m1.962s
sys 0m1.584s
Generally everything works as advertised in regard to handling large files. Metrics are slightly, but strictly worse than VW (either logistic or hinge). We fall just a bit short of the accuracy reported in the paper, 0.995, probably because of using default hyperparams.
accuracy: 0.993244597109
AUC: 0.993511427279
Probability estimates?
Now let’s try logistic regression solver so that we can get probability estimates. It shows less stats and iterations take longer - the first 140 seconds, the last about 110 seconds.
$ time ./blocktrain -s 7 data/train_v.txt.2/ /path/to/model/file
iter 1 time 141 runtime 141 loadtime 0 obj -19147.418065
iter 2 time 282 runtime 282 loadtime 0 obj -19588.075924
iter 3 time 393 runtime 393 loadtime 0 obj -19605.536968
iter 4 time 507 runtime 507 loadtime 0 obj -19612.382242
iter 5 time 621 runtime 621 loadtime 0 obj -19613.193329
real 10m22.001s
user 7m11.244s
sys 0m18.864s
Be careful to specify a valid path for a model file. Otherwise the program will run and won’t complain, but there will be no model.
$ time ./blocktrain -s 7 data/train_v.txt.2/ data/cdblock_model_s7
iter 1 time 140 runtime 140 loadtime 0 obj -19147.418065
iter 2 time 275 runtime 275 loadtime 0 obj -19588.075924
iter 3 time 383 runtime 383 loadtime 0 obj -19605.536968
iter 4 time 495 runtime 495 loadtime 0 obj -19612.382242
iter 5 time 609 runtime 609 loadtime 0 obj -19613.193329
real 10m15.373s
user 7m14.188s
sys 0m21.043s
Time to predict.
$ time ./predict data/test_v.txt data/cdblock_model_s7 data/cdblock_p_s7.txt
Accuracy = 98.5201% (68836/69870)
real 1m7.231s
user 1m2.190s
sys 0m1.569s
The results are significantly worse:
accuracy: 0.985201087734
AUC: 0.985763288671
Let’s tell Liblinear we want probabilities, using -b 1
:
$ time ./predict -b 1 data/test_v.txt data/cdblock_model_s7 data/cdblock_p_s7_probs.txt
probability output is only supported for logistic regression
Apparently no probability estimates from Liblinear today.
Generally for us Liblinear is no match for Vowpal Wabbit, at least with this dataset. It runs longer, offers worse feedback and slightly worse scores. Maybe by tuning the hyperparams one could improve the results: the authors report setting the most important parameter, C, to 64, while the default is 1. However, they also report 99.5% accuracy while we get 99.3%, so it seems there’s not that much to be gained here.
StreamSVM
Building StreamSVM is a bit more involved than building Liblinear: one needs a few dependencies. Luckily, apt-get install ...
does the trick.
With this tool, you don’t need to convert data, which is a plus. However, it seems to have a price - it takes much longer to train than either VW or Liblinear.
When calling the program, a small inconvenience greets us; you need to specify the number of training examples and data dimensionality at the outset. Let’s count those lines then.
wc -l train_v.txt
When you run StreamSVM, there’s some info at the start and then it goes silent. Then comes iteration info. The first iteration took 270 seconds, that is more than twice as long as with Liblinear. The value of objective function stays the same after the third iteration.
$ time ./StreamSVM -d 280130 -f 16609143 data/train_v.txt data/streamsvm_model
Warning! Setting max_nnz to 16609143
model parameter C is not specified. C=1 would be applied this time.
model_name is specified but the way to save is not specified.
the model will be saved at the end
model_name is data/streamsvm_model
Trainer id 0 Iteration 1 time 279.346566 obj 9848.694509 gap 6.150838
Cache access 13237417 epsilon 1.274152 active size 9675 access ratio 47.254550 access speed(M/sec) 0.047387
Trainer id 0 Iteration 2 time 297.081339 obj 10376.918432 gap 4.585239
Cache access 13909444 epsilon 0.100929 active size 8720 access ratio 49.653532 access speed(M/sec) 0.046820
Trainer id 0 Iteration 3 time 318.703953 obj 10380.234806 gap 0.400062
Cache access 13655784 epsilon 0.000269 active size 1606 access ratio 48.748024 access speed(M/sec) 0.042848
Trainer id 0 Iteration 4 time 302.305856 obj 10380.244612 gap 0.095953
Cache access 12151680 epsilon 0.000071 active size 1560 access ratio 43.378717 access speed(M/sec) 0.040197
Trainer id 0 Iteration 5 time 272.086716 obj 10380.244735 gap 0.000240
Cache access 11496425 epsilon 0.000064 active size 1553 access ratio 41.039607 access speed(M/sec) 0.042253
Total cache access 64450750
obj 10380.244735 Total time 1469.524430
real 25m9.155s
user 46m34.383s
sys 1m27.857s
One interesting thing about StreamSVM is that it uses two cores. It has the --num_learners
command line parameter, but according to the author it doesn’t work at the moment. As far as hyperparams go, in the paper the best value for C for this dataset is 1, which is also the default.
$ time ./StreamSVMPredict -f 16609143 --output_file data/streamsvm_p.txt --model_file data/streamsvm_model --train_file data/test_v.txt
Warning! Setting max_nnz to 16609143
Accuracy = 99.0597% (69213/69870)
Fscore = 99.2199%
Precision = 99.6447% (41782/41931)
Recall = 98.7988% (41782/42290)
real 1m0.091s
user 0m57.772s
sys 0m0.947s
The scores didn’t impress us:
accuracy: 0.990596822671
AUC: 0.991292619197
StreamSVM outputs both binary predictions and probabilities, like this:
-1 0.810701
-1 0.920251
-1 0.89225
Note that these are probabilities for negative class, which may be misleading. It can be corrected in classifier.hpp, we sent the author a bug report. When you take these estimates into account, AUC matches VW!
AUC: 0.998972438313
Conclusion
Current version of Vowpal Wabbit outperforms both Liblinear and StreamSVM on webspam. That’s just one dataset, so it would be good to compare using a broader selection. On the other hand, the results show that one better take claims made in papers with a grain of salt, but you already knew that.
VW is also faster and allows using non-linear features like n-grams, which should score even better without any additional hassle. Its advantage is no surprise if you take into consideration that VW comes from a strong industrial lab, while the two other projects are of more academic quality.
We would like to thank Rajarshi Das and Peter Ritz, the main supporters of the Indiegogo campaign, and Gurjeet Singh, who was the first to contribute. You made this article possible.