An interesting development occured in the Job salary prediction at Kaggle: the guy who ranked 3rd used logistic regression, in spite of the task being regression, not classification. We attempt to replicate the experiment.
The idea is to discretize salaries into a number of bins, just like with a histogram. Guocong Song, the man, used 30 bins. We like a convenient uniform bin width of 0.1, as a minimum log salary in the training set is 8.5 and a maximum is 12.2. Since there are few examples in the high end, we stop at 12.0, so that gives us 36 bins. Here’s the code:
import numpy as np
min_salary = 8.5
max_salary = 12.0
interval = 0.1
a_range = np.arange( min_salary, max_salary + interval, interval )
class_mapping = {}
for i, n in enumerate( a_range ):
n = round( n, 1 )
class_mapping[n] = i + 1
This way we get a mapping from log salaries to classes. Class labels start with 1, because Vowpal Wabbit expects that, and we intend to use VW. The code can be found in 2vw_class.py
script. The actual conversion looks like this:
try:
label = str( class_mapping[label] )
except KeyError:
if label > max_salary:
label = str( class_mapping[max_salary] )
else:
label = '1'
After running VW and producing some predictions, we will need to convert classes back to numbers:
reverse_class_mapping = {}
for i, n in enumerate( a_range ):
n = round( n, 1 )
reverse_class_mapping[i+1] = n
We do this with reverse_map.py
script. We also provide mae_class.py
script for validation purposes. It computes Mean Absolute Error from predictions and test files, both with class labels.
Running VW
Vowpal Wabbit supports a few multiclass classification modes, among them common one against all mode and error correcting tournament mode. If you want to know what it means, there’s a paper. We found out that --ect
works better than --oaa
.
For validation, one might use commands like these. You need to specify a number of classes in training, we have 36.
vw -d data/class/train_v.vw -k -c -f data/class/model --passes 10 -b 25 --ect 36
vw -t -d data/class/test_v.vw -k -c -i data/class/model -p data/class/p.txt
python mae_class.py data/class/test_v.vw data/class/p.txt
Validation MAE is around 5000, not quite 3900 achieved by G. Song, but much better than our previous attempts using regression, and with practically no tweaking. A Kaggle private leaderboard score is 5244.
In case you would like to rush off to apply this approach to your regression problem: we tried it with Bulldozers competition and it didn’t work, meaning that the results were somewhat worse than from a regular regression.
The question remains - why? We think it might have something to do with different data structure and maybe with a scoring metric.