FastML

Machine learning made easy

Regression as classification

An interesting development occured in the Job salary prediction at Kaggle: the guy who ranked 3rd used logistic regression, in spite of the task being regression, not classification. We attempt to replicate the experiment.

The idea is to discretize salaries into a number of bins, just like with a histogram. Guocong Song, the man, used 30 bins. We like a convenient uniform bin width of 0.1, as a minimum log salary in the training set is 8.5 and a maximum is 12.2. Since there are few examples in the high end, we stop at 12.0, so that gives us 36 bins. Here’s the code:

import numpy as np

min_salary = 8.5
max_salary = 12.0 
interval = 0.1

a_range = np.arange( min_salary, max_salary + interval, interval )
class_mapping = {}
for i, n in enumerate( a_range ):
    n = round( n, 1 )
    class_mapping[n] = i + 1

This way we get a mapping from log salaries to classes. Class labels start with 1, because Vowpal Wabbit expects that, and we intend to use VW. The code can be found in 2vw_class.py script. The actual conversion looks like this:

try:
    label = str( class_mapping[label] )
except KeyError:
    if label > max_salary:
        label = str( class_mapping[max_salary] )
    else:
        label = '1'

After running VW and producing some predictions, we will need to convert classes back to numbers:

reverse_class_mapping = {}
for i, n in enumerate( a_range ):
    n = round( n, 1 )
    reverse_class_mapping[i+1] = n

We do this with reverse_map.py script. We also provide mae_class.py script for validation purposes. It computes Mean Absolute Error from predictions and test files, both with class labels.

Running VW

Vowpal Wabbit supports a few multiclass classification modes, among them common one against all mode and error correcting tournament mode. If you want to know what it means, there’s a paper. We found out that --ect works better than --oaa.

For validation, one might use commands like these. You need to specify a number of classes in training, we have 36.

vw -d data/class/train_v.vw -k -c -f data/class/model --passes 10 -b 25 --ect 36

vw -t -d data/class/test_v.vw -k -c -i data/class/model -p data/class/p.txt

python mae_class.py data/class/test_v.vw data/class/p.txt

Validation MAE is around 5000, not quite 3900 achieved by G. Song, but much better than our previous attempts using regression, and with practically no tweaking. A Kaggle private leaderboard score is 5244.

post-contest submission result

In case you would like to rush off to apply this approach to your regression problem: we tried it with Bulldozers competition and it didn’t work, meaning that the results were somewhat worse than from a regular regression.

The question remains - why? We think it might have something to do with different data structure and maybe with a scoring metric.

Comments