Our Indiegogo campaign turned out to be a (partial) success, so we deliver as promised: a comparison of Vowpal Wabbit, Liblinear/SBM and StreamSVM on the webspam dataset. Refer to the Comparing large-scale linear learners for motivation and references.
Just to recap: the Liblinear/SBM and StreamSVM papers deal with linear learners able to handle data which doesn’t fit into memory. Among other things, the authors experiment with various tools and find VW not quite competitive. We set out to clarify the situation using current versions of the software involved.
The authors of the Liblinear/SBM paper run Vowpal Wabbit on two datasets, so it narrows down the selection as far as reproducing the results goes. These two sets are webspam and kdd10. We have chosen the first because it’s bigger: 24GB vs 5GB. The downside of webspam is that it’s easy to get 99% accuracy, while kdd10 is more demanding, with accuracy in 85-90% range.
Webspam has 350k rows and 16.6M features. It is available from the LibSVM dataset repository and was originally used in Pascal Large Scale Learning Challenge. The underlying data consists of web pages HTML source, classified as spam or not. We’re dealing with the trigram version, there’s also a unigram version with much smaller number of features. We split the data randomly into training and test: 270k / 80k examples, as was done in the papers.
python split.py train.txt train_v.txt test_v.txt 0.8 dat_seed
split.py is a phraug script, it will randomly split the given file into two, 80% of lines will go to the first, the rest to the second. We use a random seed so that you can get exactly the same files.
Since data is in libsvm format, we need to convert it for VW. The format in this case would differ from libsvm only by something like “|x” between the label and the rest of the line. To speed things up, let’s skip the text file and save data as VW’s binary/cache file. Instead of saving the output to a file, we send it to standard output (Unix territory) and pipe it directly to VW, which will train and create a cache file as a side effect we seek.
python libsvm2vw.py train_v.txt /dev/stdout | vw --cache_file train_v.cache
python libsvm2vw.py test_v.txt /dev/stdout | vw --cache_file test_v.cache
UPDATE: contents of the cache depend on the number of bits used for hash table (
-b). One can use existing cache for training with fewer bits, but not vice-versa. Therefore it makes sense to create the cache with higher bits:
python libsvm2vw.py train_v.txt /dev/stdout | vw -b 29 --cache_file train_v.cache
python libsvm2vw.py test_v.txt /dev/stdout | vw -b 29 --cache_file test_v.cache
And now we find the optimal number of passes using VW’s built-in validation/early-stopping mode:
vw --cache_file train_v.cache --loss_function logistic -b 29 -P 10000 --passes 10 (...) finished run number of examples per pass = 252163 passes used = 5 weighted example sum = 1.26082e+06 weighted label sum = 267743 average loss = 0.0341765 h best constant = 0.212357 total feature number = 4703937821
Then disable it (
--holdout_off) to train on the full file:
$ time vw --cache_file train_v.cache --loss_function logistic -b 29 -P 10000 --passes 5 --holdout_off -f model_v_logistic (...) finished run number of examples per pass = 280130 passes used = 5 weighted example sum = 1.40065e+06 weighted label sum = 298340 average loss = 0.0178536 best constant = 0.213001 total feature number = 5225004535 real 4m10.574s user 2m55.532s sys 0m14.009s
See Holdout test for validation and early stopping for explanation of this procedure.
Once we have predictions, we can score them using the code available at GitHub:
accuracy: 0.994675826535 AUC: 0.998418419401
We’ll be comparing classifiers’ accuracy, so another option is to use
hinge loss instead of logistic. The procedure is the same, scores are slightly worse.
accuracy: 0.993502218406 AUC: 0.99632599445
For starters let’s just try a tool that’s not designed to handle data bigger than memory. When you run regular Liblinear’s train on webspam, nothing happens for a long time. If you look at system monitoring tools, iotop shows intensive disk read (60-90 M/s, probably max for this system), while CPU is at 20-40% of one core. Then after a while CPU goes to roughly full utilization for one core, while memory usage starts to climb. That’s probably a second pass over data and loading it. Then CPU utilization starts to show a lot of irregularity, and swap usage starts to climb and the system becomes intermittently unresponsive. All this time train didn’t output a word. It’s time to kill the process and try Liblinear/SBM.
We downloaded and built the software according to the instructions in the README, no rocket science here. The software is called liblinear-cdblock and is a fork of Liblinear, so if you need other non-standard extensions you’re out of luck. Instead of the train program in the original you get blocktrain, and additionally blocksplit.
First thing to do is to blocksplit a training file into a number of blocks. Each block should fit into memory. The default number of blocks is eight. With 12GB memory and webspam you can get away with just two blocks.
The process takes one core to roughly 100%, utilizes practically no memory and creates a folder with blocks in the current directory. The folder is named after the input file with a dot and a number of blocks at the end, for example
$ time ./blocksplit -m 2 data/train_v.txt ............................ time : 822 real 13m42.357s user 12m8.597s sys 0m10.435s
The software outputs a series of dots so at least you know something is going on.
Now it’s time to run blocktrain. We use the default settings. CPU gets maxed out, memory usage grows. After a while the software spits out a line, the first iteration stats, including cumulative run time, value of the objective function and of the gap (whatever it is). The default setting for max iterations is five, and the first takes roughly two minuts on our hardware. The next iterations are progressively quicker. The objective stays the same after the second iteration, only the gap gets smaller.
In the middle of each iteration CPU util falls to 0 for a couple of seconds. This is synchronized with memory usage pattern.
$ time ./blocktrain data/train_v.txt.2/ data/cdblock_model iter 1 time 118 runtime 118 loadtime 0 obj -9213.28 PGmax 0.015024 PGmin -0.01629 gap 0.031314 iter 2 time 228 runtime 228 loadtime 0 obj -9234.21 PGmax 0.013591 PGmin -0.011198 gap 0.02479 iter 3 time 315 runtime 315 loadtime 0 obj -9234.22 PGmax 0.009242 PGmin -0.027316 gap 0.036558 iter 4 time 402 runtime 402 loadtime 0 obj -9234.22 PGmax 0.0038895 PGmin -0.010379 gap 0.014268 iter 5 time 488 runtime 488 loadtime 0 obj -9234.23 PGmax 0.0017845 PGmin -0.0024688 gap 0.0042533 real 8m14.249s user 5m17.476s sys 0m18.643s
Predict seems to support probability estimates, but only with logistic regression. Otherwise you get a bunch of ones and negative ones.
$ time ./predict data/test_v.txt data/cdblock_model data/cdblock_p.txt Accuracy = 99.3245% (69398/69870) real 1m5.930s user 1m1.962s sys 0m1.584s
Generally everything works as advertised in regard to handling large files. Metrics are slightly, but strictly worse than VW (either logistic or hinge). We fall just a bit short of the accuracy reported in the paper, 0.995, probably because of using default hyperparams.
accuracy: 0.993244597109 AUC: 0.993511427279
Now let’s try logistic regression solver so that we can get probability estimates. It shows less stats and iterations take longer - the first 140 seconds, the last about 110 seconds.
$ time ./blocktrain -s 7 data/train_v.txt.2/ /path/to/model/file iter 1 time 141 runtime 141 loadtime 0 obj -19147.418065 iter 2 time 282 runtime 282 loadtime 0 obj -19588.075924 iter 3 time 393 runtime 393 loadtime 0 obj -19605.536968 iter 4 time 507 runtime 507 loadtime 0 obj -19612.382242 iter 5 time 621 runtime 621 loadtime 0 obj -19613.193329 real 10m22.001s user 7m11.244s sys 0m18.864s
Be careful to specify a valid path for a model file. Otherwise the program will run and won’t complain, but there will be no model.
$ time ./blocktrain -s 7 data/train_v.txt.2/ data/cdblock_model_s7 iter 1 time 140 runtime 140 loadtime 0 obj -19147.418065 iter 2 time 275 runtime 275 loadtime 0 obj -19588.075924 iter 3 time 383 runtime 383 loadtime 0 obj -19605.536968 iter 4 time 495 runtime 495 loadtime 0 obj -19612.382242 iter 5 time 609 runtime 609 loadtime 0 obj -19613.193329 real 10m15.373s user 7m14.188s sys 0m21.043s
Time to predict.
$ time ./predict data/test_v.txt data/cdblock_model_s7 data/cdblock_p_s7.txt Accuracy = 98.5201% (68836/69870) real 1m7.231s user 1m2.190s sys 0m1.569s
The results are significantly worse:
accuracy: 0.985201087734 AUC: 0.985763288671
Let’s tell Liblinear we want probabilities, using
$ time ./predict -b 1 data/test_v.txt data/cdblock_model_s7 data/cdblock_p_s7_probs.txt probability output is only supported for logistic regression
Apparently no probability estimates from Liblinear today.
Generally for us Liblinear is no match for Vowpal Wabbit, at least with this dataset. It runs longer, offers worse feedback and slightly worse scores. Maybe by tuning the hyperparams one could improve the results: the authors report setting the most important parameter, C, to 64, while the default is 1. However, they also report 99.5% accuracy while we get 99.3%, so it seems there’s not that much to be gained here.
Building StreamSVM is a bit more involved than building Liblinear: one needs a few dependencies. Luckily,
apt-get install ... does the trick.
With this tool, you don’t need to convert data, which is a plus. However, it seems to have a price - it takes much longer to train than either VW or Liblinear.
When calling the program, a small inconvenience greets us; you need to specify the number of training examples and data dimensionality at the outset. Let’s count those lines then.
wc -l train_v.txt
When you run StreamSVM, there’s some info at the start and then it goes silent. Then comes iteration info. The first iteration took 270 seconds, that is more than twice as long as with Liblinear. The value of objective function stays the same after the third iteration.
$ time ./StreamSVM -d 280130 -f 16609143 data/train_v.txt data/streamsvm_model Warning! Setting max_nnz to 16609143 model parameter C is not specified. C=1 would be applied this time. model_name is specified but the way to save is not specified. the model will be saved at the end model_name is data/streamsvm_model Trainer id 0 Iteration 1 time 279.346566 obj 9848.694509 gap 6.150838 Cache access 13237417 epsilon 1.274152 active size 9675 access ratio 47.254550 access speed(M/sec) 0.047387 Trainer id 0 Iteration 2 time 297.081339 obj 10376.918432 gap 4.585239 Cache access 13909444 epsilon 0.100929 active size 8720 access ratio 49.653532 access speed(M/sec) 0.046820 Trainer id 0 Iteration 3 time 318.703953 obj 10380.234806 gap 0.400062 Cache access 13655784 epsilon 0.000269 active size 1606 access ratio 48.748024 access speed(M/sec) 0.042848 Trainer id 0 Iteration 4 time 302.305856 obj 10380.244612 gap 0.095953 Cache access 12151680 epsilon 0.000071 active size 1560 access ratio 43.378717 access speed(M/sec) 0.040197 Trainer id 0 Iteration 5 time 272.086716 obj 10380.244735 gap 0.000240 Cache access 11496425 epsilon 0.000064 active size 1553 access ratio 41.039607 access speed(M/sec) 0.042253 Total cache access 64450750 obj 10380.244735 Total time 1469.524430 real 25m9.155s user 46m34.383s sys 1m27.857s
One interesting thing about StreamSVM is that it uses two cores. It has the
--num_learners command line parameter, but according to the author it doesn’t work at the moment. As far as hyperparams go, in the paper the best value for C for this dataset is 1, which is also the default.
$ time ./StreamSVMPredict -f 16609143 --output_file data/streamsvm_p.txt --model_file data/streamsvm_model --train_file data/test_v.txt Warning! Setting max_nnz to 16609143 Accuracy = 99.0597% (69213/69870) Fscore = 99.2199% Precision = 99.6447% (41782/41931) Recall = 98.7988% (41782/42290) real 1m0.091s user 0m57.772s sys 0m0.947s
The scores didn’t impress us:
accuracy: 0.990596822671 AUC: 0.991292619197
StreamSVM outputs both binary predictions and probabilities, like this:
-1 0.810701 -1 0.920251 -1 0.89225
Note that these are probabilities for negative class, which may be misleading. It can be corrected in classifier.hpp, we sent the author a bug report. When you take these estimates into account, AUC matches VW!
Current version of Vowpal Wabbit outperforms both Liblinear and StreamSVM on webspam. That’s just one dataset, so it would be good to compare using a broader selection. On the other hand, the results show that one better take claims made in papers with a grain of salt, but you already knew that.
VW is also faster and allows using non-linear features like n-grams, which should score even better without any additional hassle. Its advantage is no surprise if you take into consideration that VW comes from a strong industrial lab, while the two other projects are of more academic quality.
We would like to thank Rajarshi Das and Peter Ritz, the main supporters of the Indiegogo campaign, and Gurjeet Singh, who was the first to contribute. You made this article possible.