Lately we’ve been working with the Madelon dataset. It was originally prepared for a feature selection challenge, so while we’re at it, let’s select some features. Madelon has 500 attributes, 20 of which are real, the rest being noise. Hence the ideal scenario would be to select just those 20 features.
Fortunately we know just the right software for this task. It’s called mRMR, for minimum Redundancy Maximum Relevance, and is available in C and Matlab versions for various platforms. mRMR expects a CSV file with labels in the first column and feature names in the first row. So the game plan is:
- combine training and validation sets into a format expected by mRMR
- run selection
- filter the original datasets, discarding all features but the selected ones
- evaluate the results on the validation set
- if all goes well, prepare and submit files for the competition
We’ll use R scripts for all the steps but feature selection. Now a few words about mRMR. It will show you possible options when run without parameters. Most of these options are self-explaining. There are two which might need explanation:
-t. The first is used to select a method, and we stick with default. If you are interested, consult the paper.
The second one is a threshold for discretization. It has to do with the fact that mRMR needs to discretize feature values. With the threshold at zero (
-t 0), it will just binarize: “above the mean” and “below the mean”. If you specify a threshold, there will be three brackets, marked by two points:
- the mean - t * standard deviation
- the mean + t * standard deviation
A threshold of one seems to be working well here. We’ll ask to select 20 features and use 10000 samples (or all available):
mrmr -i data\combined_train_val.csv -n 20 -s 10000 -t 1
You have specified parameters: threshold=mu+/-1.00*sigma #fea=20 selection method=MID #maxVar=10000 #maxSample=10000 Target classification variable (#1 column in the input data) has name=y_combined entropy score=1.000 *** MaxRel features *** Order Fea Name Score 1 339 V339 0.031 2 242 V242 0.024 3 476 V476 0.023 4 337 V337 0.022 5 65 V65 0.020 6 473 V473 0.016 7 443 V443 0.015 8 129 V129 0.013 9 106 V106 0.010 10 49 V49 0.008 11 454 V454 0.007 12 494 V494 0.006 13 379 V379 0.005 14 5 V5 0.004 15 286 V286 0.004 16 411 V411 0.003 17 260 V260 0.003 18 287 V287 0.003 19 153 V153 0.003 20 324 V324 0.002 *** mRMR features *** Order Fea Name Score 1 339 V339 0.031 2 5 V5 0.004 3 49 V49 0.004 4 286 V286 0.002 5 287 V287 0.002 6 337 V337 0.003 7 411 V411 0.002 8 260 V260 0.002 9 153 V153 0.001 10 225 V225 0.001 11 497 V497 0.001 ^C
You’ll notice that there are two sets of features: MaxRel and mRMR. The first set takes only a short while to select, while the second needs time quadratic in the number of features, so with each additional feature you wait longer and longer. An upside is that usually it produces slightly better results, but that’s not the case here, as you can see in the scores. We expect at least a few attributes with relatively higher scores and that’s what we get from MaxRel (the chart), but not from mRMR.
Where you cut off is a matter of some testing, we go with 13 attributes. The indexes for them start from one. This is consistent with R indexing, so now we just copy and paste selected indexes into an R script and proceed:
mrmr_indexes = c( 339, 242, 476, 337, 65, 473, 443, 129, 106, 49, 454, 494, 379 )
Turns out that the process indeed improves the results: we get AUC = 0.96 on a test set, vs. 0.93 before. To obtain even better score, we could write a few scripts and run Spearmint to optimize the threshold, a number of features used and a number of trees in a forest. Some other time, maybe. There’s so many interesting things to try.