Lately we’ve been working with the Madelon dataset. It was originally prepared for a feature selection challenge, so while we’re at it, let’s select some features. Madelon has 500 attributes, 20 of which are real, the rest being noise. Hence the ideal scenario would be to select just those 20 features.
Fortunately we know just the right software for this task. It’s called mRMR, for minimum Redundancy Maximum Relevance, and is available in C and Matlab versions for various platforms. mRMR expects a CSV file with labels in the first column and feature names in the first row. So the game plan is:
- combine training and validation sets into a format expected by mRMR
- run selection
- filter the original datasets, discarding all features but the selected ones
- evaluate the results on the validation set
- if all goes well, prepare and submit files for the competition
We’ll use R scripts for all the steps but feature selection. Now a few words about mRMR. It will show you possible options when run without parameters. Most of these options are self-explaining. There are two which might need explanation: -m
and -t
. The first is used to select a method, and we stick with default. If you are interested, consult the paper.
The second one is a threshold for discretization. It has to do with the fact that mRMR needs to discretize feature values. With the threshold at zero (-t 0
), it will just binarize: “above the mean” and “below the mean”. If you specify a threshold, there will be three brackets, marked by two points:
- the mean - t * standard deviation
- the mean + t * standard deviation
A threshold of one seems to be working well here. We’ll ask to select 20 features and use 10000 samples (or all available):
mrmr -i data\combined_train_val.csv -n 20 -s 10000 -t 1
The output:
You have specified parameters: threshold=mu+/-1.00*sigma #fea=20 selection method=MID #maxVar=10000 #maxSample=10000
Target classification variable (#1 column in the input data) has name=y_combined
entropy score=1.000
*** MaxRel features ***
Order Fea Name Score
1 339 V339 0.031
2 242 V242 0.024
3 476 V476 0.023
4 337 V337 0.022
5 65 V65 0.020
6 473 V473 0.016
7 443 V443 0.015
8 129 V129 0.013
9 106 V106 0.010
10 49 V49 0.008
11 454 V454 0.007
12 494 V494 0.006
13 379 V379 0.005
14 5 V5 0.004
15 286 V286 0.004
16 411 V411 0.003
17 260 V260 0.003
18 287 V287 0.003
19 153 V153 0.003
20 324 V324 0.002
*** mRMR features ***
Order Fea Name Score
1 339 V339 0.031
2 5 V5 0.004
3 49 V49 0.004
4 286 V286 0.002
5 287 V287 0.002
6 337 V337 0.003
7 411 V411 0.002
8 260 V260 0.002
9 153 V153 0.001
10 225 V225 0.001
11 497 V497 0.001
^C
You’ll notice that there are two sets of features: MaxRel and mRMR. The first set takes only a short while to select, while the second needs time quadratic in the number of features, so with each additional feature you wait longer and longer. An upside is that usually it produces slightly better results, but that’s not the case here, as you can see in the scores. We expect at least a few attributes with relatively higher scores and that’s what we get from MaxRel (the chart), but not from mRMR.
Where you cut off is a matter of some testing, we go with 13 attributes. The indexes for them start from one. This is consistent with R indexing, so now we just copy and paste selected indexes into an R script and proceed:
mrmr_indexes = c( 339, 242, 476, 337, 65, 473, 443, 129, 106, 49, 454, 494, 379 )
Turns out that the process indeed improves the results: we get AUC = 0.96 on a test set, vs. 0.93 before. To obtain even better score, we could write a few scripts and run Spearmint to optimize the threshold, a number of features used and a number of trees in a forest. Some other time, maybe. There’s so many interesting things to try.