There’s a contest at Kaggle held by Qatar University. They want to be able to discriminate men from women based on handwriting. For a thousand bucks, well, why not?
As Sashi noticed on the forums, it’s not difficult to improve on the benchmarks a little bit. In particular, he mentioned feature selection, normalizing the data and using a regularized linear model. Here’s our version of the story.
Let’s start with normalizing. There’s a nice function for that in R, scale()
. The dataset is small, 1128 examples, so we can go ahead and use R.
Turns out that in its raw form, the data won’t scale. That’s because apparently there are some columns with zeros only. That makes it difficult to divide.
Fortunately, we know just the right tool for the task. We learned about it on Kaggle forums too. It’s a function in caret package called nearZeroVar()
. It will give you indexes of all the columns which have near zero variance, so you can delete them:
nzv_i = nearZeroVar( data )
data = data[,-nzv_i]
Easy as that. We can scale now. But wait - weren’t there non-numerical features in the data (Arabic/English)? Weren’t there writer IDs?
No big deal here. We’ll just represent Arabic/English as 0/1 and get rid of writer IDs before scaling.
In fact, we’ll do this with a Python script even before loading data into R. It sure can be done with R but we just fancied Python.
Feature selection
The dataset has a thousand examples and seven thousand features. That’s not a good ratio, because it offers a tremendous overfitting potential. One good thing is that we just pruned about 2500 near-zero columns.
Out of approx. 4500 features left, we select 100. That’s an arbitrary number. In any case, it makes working with R much faster. We use mRMR, as described in Feature selection in practice.
Training a classifier
Regularized linear model works better than GLM. We used ridge package. Its advantage is an ability to determine lambda automatically. Lambda means the amount of regularization.
model = linearRidge( V1 ~ ., train ) # V1 is y, or gender
p = predict( model, test )
# normalize p to <0,1> range here,
# or just cut down values outside this range
or maybe logisticRidge, which is a bit slower:
model = logisticRidge( V1 ~ ., train )
p = predict( model, test )
p = sigmoid( p )
Random forest works similiarly well. Either classifier will get you beneath 0.6 score*.
Validation
There’s a sneaky issue with this data set: if you want realistic metrics from validation, you need to validate a certain way. Namely, you need to split the set so that the writers which are in the train part are not in the test part.
Otherwise any powerful classifier like random forest will learn to discriminate on particular writer’s style. If it learns the style from the train set, it will be able to apply this knowledge in prediction. This will result in overly optimistic score, as the real test set will consist only of unknown writers.
We provide a Python script for randomly splitting the data taking the writer issue into account. You run it like this:
split_by_writers.py original_train.csv train.csv train_val.csv test_val.csv 0.9
The first argument is the original training file with a writers column. The second is the file you want to split. And the reason that these are different is that you may want to split a file with writers info already stripped.
The third and fourth arguments are output files and the fifth is a ratio between these files - a probability, by default 0.9.
If instead of splitting you’d like to divide a file into a few same-sized chunks, use this:
chunk_by_writers.py original_train train.csv 10
10 is a number of chunks. You don’t specify output files, the script will name them automatically like this:
train_0.csv
train_1.csv
…
train_9.csv
These files could then be used for cross validation.
There you have it. Now go beat the benchmark.
*We didn’t check. That’s our hunch based on validation results.