# Regression as classification

An interesting development occured in the Job salary prediction at Kaggle: the guy who ranked 3rd used logistic regression, in spite of the task being regression, not classification. We attempt to replicate the experiment.

The idea is to discretize salaries into a number of bins, just like with a histogram. Guocong Song, the man, used 30 bins. We like a convenient uniform bin width of 0.1, as a minimum log salary in the training set is 8.5 and a maximum is 12.2. Since there are few examples in the high end, we stop at 12.0, so that gives us 36 bins. Here’s the code:

``````import numpy as np

min_salary = 8.5
max_salary = 12.0
interval = 0.1

a_range = np.arange( min_salary, max_salary + interval, interval )
class_mapping = {}
for i, n in enumerate( a_range ):
n = round( n, 1 )
class_mapping[n] = i + 1
``````

This way we get a mapping from log salaries to classes. Class labels start with 1, because Vowpal Wabbit expects that, and we intend to use VW. The code can be found in `2vw_class.py` script. The actual conversion looks like this:

``````try:
label = str( class_mapping[label] )
except KeyError:
if label > max_salary:
label = str( class_mapping[max_salary] )
else:
label = '1'
``````

After running VW and producing some predictions, we will need to convert classes back to numbers:

``````reverse_class_mapping = {}
for i, n in enumerate( a_range ):
n = round( n, 1 )
reverse_class_mapping[i+1] = n
``````

We do this with `reverse_map.py` script. We also provide `mae_class.py` script for validation purposes. It computes Mean Absolute Error from predictions and test files, both with class labels.

## Running VW

Vowpal Wabbit supports a few multiclass classification modes, among them common one against all mode and error correcting tournament mode. If you want to know what it means, there’s a paper. We found out that `--ect` works better than `--oaa`.

For validation, one might use commands like these. You need to specify a number of classes in training, we have 36.

`vw -d data/class/train_v.vw -k -c -f data/class/model --passes 10 -b 25 --ect 36`

`vw -t -d data/class/test_v.vw -k -c -i data/class/model -p data/class/p.txt`

`python mae_class.py data/class/test_v.vw data/class/p.txt`

Validation MAE is around 5000, not quite 3900 achieved by G. Song, but much better than our previous attempts using regression, and with practically no tweaking. A Kaggle private leaderboard score is 5244. In case you would like to rush off to apply this approach to your regression problem: we tried it with Bulldozers competition and it didn’t work, meaning that the results were somewhat worse than from a regular regression.

The question remains - why? We think it might have something to do with different data structure and maybe with a scoring metric.