Converting categorical data into numbers with Pandas and Scikit-learn

Many machine learning tools will only accept numbers as input. This may be a problem if you want to use such tool but your data includes categorical features. To represent them as numbers typically one converts each categorical feature using “one-hot encoding”, that is from a value like “BMW” or “Mercedes” to a vector of zeros and one 1.

This functionality is available in some software libraries. We load data using Pandas, then convert categorical columns with DictVectorizer from scikit-learn.

Pandas is a popular Python library inspired by data frames in R. It allows easier manipulation of tabular numeric and non-numeric data. Downsides: not very intuitive, somewhat steep learning curve. For any questions you may have, Google + StackOverflow combo works well as a source of answers.

UPDATE: Turns out that Pandas has get_dummies() function which does what we’re after. The following code will replace categorical columns with their one-hot representations:

cols_to_transform = [ 'a', 'list', 'of', 'categorical', 'column', 'names' ]
df_with_dummies = pd.get_dummies( columns = cols_to_transform )

This is the way we recommend now. (end update)

We’ll use Pandas to load the data, do some cleaning and send it to Scikit-learn’s DictVectorizer. OneHotEncoder is another option. The difference is as follows:

OneHotEncoder takes as input categorical values encoded as integers - you can get them from LabelEncoder.

DictVectorizer expects data as a list of dictionaries, where each dictionary is a data row with column names as keys:

[ { 'foo': 1, 'bar': 'z' }, 
  { 'foo': 2, 'bar': 'a' },
  { 'foo': 3, 'bar': 'c' } ]

After vectorizing and saving as CSV it would look like this:

foo,bar=z,bar=a,bar=c
1,1,0,0
2,0,1,0
3,0,0,1

Notice the column names and that DictVectorizer doesn’t touch numeric values.

The representation above is redundant, because to encode three values you need two indicator columns. In general, one needs d - 1 columns for d values. This is not a big deal, but apparently some methods will complain about collinearity. The solution is to drop one of the columns. It won’t result in information loss, because in the redundant scheme with d columns one of the indicators must be non-zero, so if two out of three are zeros then the third must be 1. And if one among the two is positive than the third must be zero.

Pandas

Before

To convert some columns from a data frame to a list of dicts, we call df.to_dict( orient = 'records' ) (thanks to Jos P. Gonzlez-Brenes for the tip):

cols_to_retain = [ 'a', 'list', 'of', 'categorical', 'column', 'names' ]
cat_dict = df[ cols_to_retain ].to_dict( orient = 'records' )

If you have a few categorical columns, you can list them as above. In the Analytics Edge competition, there are about 100 categorical columns, so in this case it’s easier to drop columns which are not categorical:

cols_to_drop = [ 'UserID, 'YOB', 'votes', 'Happy' ]
cat_dict = df.drop( cols_to_drop, axis = 1 ).to_dict( orient = 'records' )

After

Using the vectorizer

from sklearn.feature_extraction import DictVectorizer as DV

vectorizer = DV( sparse = False )
vec_x_cat_train = vectorizer.fit_transform( x_cat_train )
vec_x_cat_test = vectorizer.transform( x_cat_test )

If the data has missing values, they will become NaNs in the resulting Numpy arrays. Therefore it’s advisable to fill them in with Pandas first:

cat_data = cat_data_with_missing_values.fillna( 'NA' )

This way, the vectorizer will create additional column <feature>=NA for each feature with NAs.

Handling binary features with missing values

If you have missing values in a binary feature, there’s an alternative representation:

-1 for negatives
0 for missing values
1 for positives

It worked better in case of the Analytics Edge competition: an SVM trained on one-hot encoded data with d indicators scored 0.768 in terms of AUC, while the alternative representation yielded 0.778. That simple solution would give you 30th place out of 1686 contenders.

Code

We provide a sample script that loads data from CSV and vectorizes selected columns. Copy and paste the parts you find useful. If you’d like to run the script, you’ll need:

data from the Analytics Edge competition.
to split the training set into two files for validation, for example with split.py

split.py train.csv train_v.csv test_v.csv 0.8

Make sure to have headers in both files. With split.py, headers from the source file will end up in one of the output files, probably in train. Just copy the header line to the other file using a text editor or head -n 1 and cat on Unix.

FastML

Machine learning made easy

Converting categorical data into numbers with Pandas and Scikit-learn

Pandas

Using the vectorizer

Handling binary features with missing values

Code

Comments