Machine learning made easy

Regression as classification

An interesting development occured in the Job salary prediction at Kaggle: the guy who ranked 3rd used logistic regression, in spite of the task being regression, not classification. We attempt to replicate the experiment.

The idea is to discretize salaries into a number of bins, just like with a histogram. Guocong Song, the man, used 30 bins. We like a convenient uniform bin width of 0.1, as a minimum log salary in the training set is 8.5 and a maximum is 12.2. Since there are few examples in the high end, we stop at 12.0, so that gives us 36 bins. Here’s the code:

import numpy as np

min_salary = 8.5
max_salary = 12.0 
interval = 0.1

a_range = np.arange( min_salary, max_salary + interval, interval )
class_mapping = {}
for i, n in enumerate( a_range ):
    n = round( n, 1 )
    class_mapping[n] = i + 1

This way we get a mapping from log salaries to classes. Class labels start with 1, because Vowpal Wabbit expects that, and we intend to use VW. The code can be found in script. The actual conversion looks like this:

    label = str( class_mapping[label] )
except KeyError:
    if label > max_salary:
        label = str( class_mapping[max_salary] )
        label = '1'

After running VW and producing some predictions, we will need to convert classes back to numbers:

reverse_class_mapping = {}
for i, n in enumerate( a_range ):
    n = round( n, 1 )
    reverse_class_mapping[i+1] = n

We do this with script. We also provide script for validation purposes. It computes Mean Absolute Error from predictions and test files, both with class labels.

Running VW

Vowpal Wabbit supports a few multiclass classification modes, among them common one against all mode and error correcting tournament mode. If you want to know what it means, there’s a paper. We found out that --ect works better than --oaa.

For validation, one might use commands like these. You need to specify a number of classes in training, we have 36.

vw -d data/class/train_v.vw -k -c -f data/class/model --passes 10 -b 25 --ect 36

vw -t -d data/class/test_v.vw -k -c -i data/class/model -p data/class/p.txt

python data/class/test_v.vw data/class/p.txt

Validation MAE is around 5000, not quite 3900 achieved by G. Song, but much better than our previous attempts using regression, and with practically no tweaking. A Kaggle private leaderboard score is 5244.

post-contest submission result

In case you would like to rush off to apply this approach to your regression problem: we tried it with Bulldozers competition and it didn’t work, meaning that the results were somewhat worse than from a regular regression.

The question remains - why? We think it might have something to do with different data structure and maybe with a scoring metric.