Machine learning made easy

Feature selection in practice

Lately we’ve been working with the Madelon dataset. It was originally prepared for a feature selection challenge, so while we’re at it, let’s select some features. Madelon has 500 attributes, 20 of which are real, the rest being noise. Hence the ideal scenario would be to select just those 20 features.

Fortunately we know just the right software for this task. It’s called mRMR, for minimum Redundancy Maximum Relevance, and is available in C and Matlab versions for various platforms. mRMR expects a CSV file with labels in the first column and feature names in the first row. So the game plan is:

  1. combine training and validation sets into a format expected by mRMR
  2. run selection
  3. filter the original datasets, discarding all features but the selected ones
  4. evaluate the results on the validation set
  5. if all goes well, prepare and submit files for the competition

We’ll use R scripts for all the steps but feature selection. Now a few words about mRMR. It will show you possible options when run without parameters. Most of these options are self-explaining. There are two which might need explanation: -m and -t. The first is used to select a method, and we stick with default. If you are interested, consult the paper.

The second one is a threshold for discretization. It has to do with the fact that mRMR needs to discretize feature values. With the threshold at zero (-t 0), it will just binarize: “above the mean” and “below the mean”. If you specify a threshold, there will be three brackets, marked by two points:

  • the mean - t * standard deviation
  • the mean + t * standard deviation

A threshold of one seems to be working well here. We’ll ask to select 20 features and use 10000 samples (or all available):

mrmr -i data\combined_train_val.csv -n 20 -s 10000 -t 1

The output:

You have specified parameters: threshold=mu+/-1.00*sigma #fea=20 selection method=MID #maxVar=10000 #maxSample=10000

Target classification variable (#1 column in the input data) has name=y_combined
        entropy score=1.000

*** MaxRel features ***
Order    Fea     Name    Score
1        339     V339    0.031
2        242     V242    0.024
3        476     V476    0.023
4        337     V337    0.022
5        65      V65     0.020
6        473     V473    0.016
7        443     V443    0.015
8        129     V129    0.013
9        106     V106    0.010
10       49      V49     0.008
11       454     V454    0.007
12       494     V494    0.006
13       379     V379    0.005
14       5       V5      0.004
15       286     V286    0.004
16       411     V411    0.003
17       260     V260    0.003
18       287     V287    0.003
19       153     V153    0.003
20       324     V324    0.002

*** mRMR features ***
Order    Fea     Name    Score
1        339     V339    0.031
2        5       V5      0.004
3        49      V49     0.004
4        286     V286    0.002
5        287     V287    0.002
6        337     V337    0.003
7        411     V411    0.002
8        260     V260    0.002
9        153     V153    0.001
10       225     V225    0.001
11       497     V497    0.001

You’ll notice that there are two sets of features: MaxRel and mRMR. The first set takes only a short while to select, while the second needs time quadratic in the number of features, so with each additional feature you wait longer and longer. An upside is that usually it produces slightly better results, but that’s not the case here, as you can see in the scores. We expect at least a few attributes with relatively higher scores and that’s what we get from MaxRel (the chart), but not from mRMR.

MaxRel feature scores

Where you cut off is a matter of some testing, we go with 13 attributes. The indexes for them start from one. This is consistent with R indexing, so now we just copy and paste selected indexes into an R script and proceed:

mrmr_indexes = c( 339, 242, 476, 337, 65, 473, 443, 129, 106, 49, 454, 494, 379 )

Turns out that the process indeed improves the results: we get AUC = 0.96 on a test set, vs. 0.93 before. To obtain even better score, we could write a few scripts and run Spearmint to optimize the threshold, a number of features used and a number of trees in a forest. Some other time, maybe. There’s so many interesting things to try.