Recently we’ve been browsing papers about out-of-core linear learning on a single machine. While for us this task is basically synonymous with Vowpal Wabbit, it turns out that there are other options.
Out-of-core means able to work with data which doesn’t fit into memory. Here’s a quote from a paper by Matsushima and Smola [3] (the references are at the bottom) that’s a good introduction:
As massive, high-dimensional data sets are becoming commonplace, there is a recent surge of interest in linear SVMs. Some recent papers tackle this problem with great success, and provide algorithms with convergence guarantees. The above solvers either assume that the data resides in memory or that a cluster of machines with appropriate communication and synchronization facilities are available.
In an award winning paper [1] revisited the problem of training linear SVMs when the data does not fit into memory. In a nutshell, the key idea is to split the data into manageable blocks, compress and store each block on disk, and perform dual coordinate descent by loading each block sequentially. This basic idea was improved upon by [2] who observed that the block minimization (BM) algorithm of [1] does not retain important points before discarding each block. They therefore, propose to retain some important points from the previous blocks in the RAM. This simple idea leads to significant speed-ups, and [2] demonstrate that their selective block minimization (SBM) algorithm outperforms BM. Here we propose a new algorithm StreamSVM, and show that it outperforms both BM and SBM on a number of publicly available datasets. [3]
The story goes like this: Liblinear people came up with block minimization [1] and a later improvement, selective block minimization [2]. Matsushima and Smola upped the ante and introduced dual cached loops method [3], which they claim is even better.
The plot suggests that StreamSVM is comparable to Liblinear when the latter is able to keep the data in memory, but so much quicker than SBM. Note that what’s plotted on y axis is a value of objective function (training loss), while we’re more interested in testing accuracy or AUC.
Where does Vowpal Wabbit fit in this picture?
SBM vs Vowpal Wabbit
Results reported in [2] on webspam and kddcup10 show that SBM and VW are very similiar in terms of speed (VW slightly faster), but SBM enjoys significantly higher accuracy:
- 99.5 vs 98.4 on webspam
- 89.7 vs 85.2 on kddcup10
Note that they use VW 5.1 - it’s not clear how the current version, 7.x, compares. Other online algorithms tested (perceptron, confidence-weighted, MIRA) offer more or less similiar accuracy but take an order of magnitude longer in I/O.
Comparison between SBM and online learners: (…) we mentioned that people in the community believe that online learners are fast for large data sets since they make use of a simple update rule. However, as we show, online learners use more I/O time when the data is not in memory.
StreamSVM vs Vowpal Wabbit
And now for something completely different: the authors of [3] don’t bother with including VW in experiments altogether:
We performed experiments on some of the largest publicly available data sets for binary classification to evaluate the performance of our algorithm, and to compare it with two existing methods, namely SBM and BM. Since SBM was shown to be competitive with other methods such as Vowpal Wabbit (VW), Stochastic Gradient Descent (SGD), or Pegasos we do not explicitly compare against them here.
Independent comparison is in order
Is it time to rethink the tool of choice for large-scale learning? To find out, we have set up a crowdsourcing campaign on Indiegogo (a site like Kickstarter, but international). The aim of the campaign is to provide funding for independent comparison of VW, Liblinear, Liblinear with SBM extension and StreamSVM. You’ll find the link below.
We plan to compare learners on the bigger version of the webspam dataset from the Libsvm repository. It’s the set used in the papers; it has 350k rows, over 16M columns and is 24 GB.
We’ll split it 80/20 for training and testing, as was done previously. This will result in 280k rows for training and 70k rows for testing. We’ll build the tools and run them on the same training set (a disclosure: we have VW built already, as it might be evident to you if you have read some of the previous articles on this site). Then produce predictions and check the accuracy and AUC.
In the process we’ll get a good feel for each tool and that will allow us to draw conclusions.
The campaign
Check it out (and contribute) on Indiegogo: Large scale linear learners compared
Software
References
[1] Yu, Hsieh, Chang and Lin - Large Linear Classification When Data Cannot Fit In Memory PDF
[2] Chang and Roth - Selective Block Minimization for Faster Convergence of Limited Memory Large-Scale Linear Models PDF
[3] Matsushima, Vishwanathan and Smola - Linear Support Vector Machines via Dual Cached Loops PDF
Earlier posts about Vowpal Wabbit.