FastML

Machine learning made easy

Math for machine learning

Sometimes people ask what math they need for machine learning. The answer depends on what you want to do, but in short our opinion is that it is good to have some familiarity with linear algebra and multivariate differentiation.

Classifier calibration with Platt’s scaling and isotonic regression

Calibration is applicable in case a classifier outputs probabilities. Apparently some classifiers have their typical quirks - for example, they say boosted trees and SVM tend to predict probabilities conservatively, meaning closer to mid-range than to extremes. If your metric cares about exact probabilities, like logarithmic loss does, you can calibrate the classifier, that is post-process the predictions to get better estimates.

This article was inspired by Andrew Tulloch’s post on Speeding up isotonic regression in scikit-learn by 5,000x.

Vowpal Wabbit eats big data from the Criteo competition for breakfast

The Criteo competition is about ad click prediction. The unpacked training set is 11 GB and has 45 million examples. While we’re not sure if it qualifies as the mythical big data, it’s quite big for Kaggle standards.

Unless you have an adequate machine, it will be difficult to process it in memory. Our solution is to use online or mini-batch learning, which deals with either one example or a small portion of examples at a time. Vowpal Wabbit is especially well suited for the contest for a number of reasons.

Optimizing hyperparams with hyperopt

Very often performance of your model depends on its parameter settings. It makes sense to search for optimal values automatically, especially if there’s more than one or two hyperparams, as is in the case of extreme learning machines. Tuning ELM will serve as an example of using hyperopt, a convenient Python package by James Bergstra.

Extreme Learning Machines

What do you get when you take out backpropagation out of a multilayer perceptron? You get an extreme learning machine, a non-linear model with the speed of a linear one.

Converting categorical data into numbers with Pandas and Scikit-learn

Many machine learning tools will only accept numbers as input. This may be a problem if you want to use such tool but your data includes categorical features. To represent them as numbers typically one converts each categorical feature using “one-hot encoding”, that is from a value like “BMW” or “Mercedes” to a vector of zeros and one 1.

This functionality is available in some software libraries. We load data using Pandas, then convert categorical columns with DictVectorizer from scikit-learn.