Calibration is applicable in case a classifier outputs probabilities. Apparently some classifiers have their typical quirks - for example, they say boosted trees and SVM tend to predict probabilities conservatively, meaning closer to mid-range than to extremes. If your metric cares about exact probabilities, like logarithmic loss does, you can calibrate the classifier, that is post-process the predictions to get better estimates.
This article was inspired by Andrew Tulloch’s post on Speeding up isotonic regression in scikit-learn by 5,000x.
Visualizing calibration with reliability diagrams
Before you attempt calibration, see how good it is to start with. The paper we’re going to refer to is Predicting good probabilities with supervised learning [PDF] by Caruana et al.
On real problems where the true conditional probabilities are not known, model calibration can be visualized with reliability diagrams (DeGroot & Fienberg, 1982). First, the prediction space is discretized into ten bins. Cases with predicted value between 0 and 0.1 fall in the first bin, between 0.1 and 0.2 in the second bin, etc.
For each bin, the mean predicted value is plotted against the true fraction of positive cases. If the model is well calibrated the points will fall near the diagonal line.
Here’s a reliability diagram of almost perfectly-calibrated classifier. It’s Vowpal Wabbit with data from the Criteo competition, if you’re curious.
x: mean predicted value for each bin, y: fraction of true positive cases.
And now a classifier that could use some calibration (Vowpal Wabbit / Avito competition):
Finally, let’s see a random forest trained on the Adult data:
It doesn’t look sigmoidal like the plots in the paper; more like sigmoid mirrored around the central line.
There are two popular calibration methods: Platt’s scaling and isotonic regression. Platt’s scaling amounts to training a logistic regression model on the classifier outputs. As Edward Raff writes:
You essentially create a new data set that has the same labels, but with one dimension (the output of the SVM). You then train on this new data set, and feed the output of the SVM as the input to this calibration method, which returns a probability. In Platt’s case, we are essentially just performing logistic regression on the output of the SVM with respect to the true class labels.
We use an additional validation set for calibration: take classifier predictions and true labels and split them, then use the first part as a training set for calibration and the second part to evaluate the results.
The code might look like the snippet below. X is a vector of classifier outputs and y are true labels.
from sklearn.linear_model import LogisticRegression as LR lr = LR() lr.fit( p_train.reshape( -1, 1 ), y_train ) # LR needs X to be 2-dimensional p_calibrated = lr.predict_proba( p_test.reshape( -1, 1 ))[:,1]
And now the Adult random forest calibrated with Platt’s scaling. The blue line shows “before” and the green line “after”. The plot looks smoother because we used fewer bins than in the diagram above.
Blue: before, green: after.
The numbers look good: AUC is unchanged and log loss reduction is dramatic.
accuracy - before/after: 0.847788697789 / 0.846805896806 AUC - before/after: 0.878139845077 / 0.878139845077 log loss - before/after: 0.630525772871 / 0.364873617584
The second popular method of calibrating is isotonic regression. The idea is to fit a piecewise-constant non-decreasing function instead of logistic regression. Piecewise-constant non-decreasing means stair-step shaped:
The stairs. Notice that this plot doesn’t deal with calibration. Credit: scikit-learn
The scikit-learn docs look a bit confusing to us, but apparently it’s just as simple as with logistic regression. There’s some talk about the order, but one doesn’t need to sort either x or y, the algorithm will take care of this.
from sklearn.isotonic import IsotonicRegression as IR ir = IR( out_of_bounds = 'clip' ) ir.fit( p_train, y_train ) p_calibrated = ir.transform( p_test ) # or ir.fit( p_test ), that's the same thing
The Adult data again:
After calibration accuracy and AUC suffer a tiny bit, but log loss gets smaller, although nowhere near the result from Platt’s scaling:
accuracy - before/after: 0.847788697789 / 0.845945945946 AUC - before/after: 0.878139845077 / 0.877184085166 log loss - before/after: 0.630525772871 / 0.592161024832
Remember the reliability diagrams for Vowpal Wabbit? Now the green line shows results after calibration by isotonic regression:
Let’s compare log loss scores:
In : ll( y_test, p_test ) Out: 0.45670528472608907 In : ll( y_test, p_test_calibrated ) Out: 0.45688394167069607
No improvement. And the second one:
It won’t come as a surprise that the score improved - the log loss dropped by 5.4%:
In : p_cal[np.isnan( p_cal )] = 1e-15 In : log_loss( y_test, p_test ) Out: 0.040977954263511369 In : log_loss( y_test, p_test_calibrated ) Out: 0.038757356232921675
There’s no point in calibrating if the classifier is already good in this respect. First make a reliability diagram and if it looks like it could be improved, then calibrate. That is, if your metric justifies it.
The code is available at GitHub. You’ll need to modify
load_data.py to suit your needs.
UPDATE: scikit-learn now has a good doc section on probability calibration.