Let’s step back from forays into cutting edge topics and look at a random forest, one of the most popular machine learning techniques today. Why is it so attractive?
Machine learning courses online
Madelon: Spearmint’s revenge
Little Spearmint couldn’t sleep that night. I was so close… - he was thinking. It seemed that he had found a better than default value for one of the random forest hyperparams, but it turned out to be false. He made a decision as he fell asleep: Next time, I will show them!
Spearmint with a random forest
Now that we have Spearmint basics nailed, we’ll try tuning a random forest, and specifically two hyperparams: a number of trees (ntrees) and a number of candidate features at each split (mtry). Here’s some code.
We’re going to use a red wine quality dataset. It has about 1600 examples and our goal will be to predict a rating for a wine given all the other properties.
Tuning hyperparams automatically with Spearmint
The promise
What’s attractive in machine learning? That a machine is learning, instead of a human. But an operator still has a lot of work to do. First, he has to learn how to teach a machine, in general. Then, when it comes to a concrete task, there are two main areas where a human needs to do the work (and remember, laziness is a virtue, at least for a programmer, so we’d like to minimize amount of work done by a human):
- data preparation
- model tuning
This story is about model tuning.
Predicting wine quality
This post is as much about wine as it is about machine learning, so if you enjoy wine, like we do, you may find it especially interesting. Here’s some R and Matlab code, and if you want to get right to the point, skip to the charts.
There’s a book by Philipp Janert called Data Analysis with Open Source Tools, which, by the way, we would recommend. From this book we found out about the wine quality datasets. There are two, one for red wine and one for white wine, and they are interesting because they contain quality ratings (1 - 10) for a few thousands of wines, along with their physical and chemical properties. We could probably use these properties to predict a rating for a wine.
The Facebook challenge HOWTO
Last time we wrote about the Facebook challenge. Now it’s time for some more details. The main concept is this: in its original state, the data is useless. That’s because there are many names referring to the same entity. Precisely, there are about 350k unique names, and the total number of entities is maybe 20k. So cleaning the data is the first and most important step.
So you want to work for Facebook
Good news, everyone! There’s a new contest on Kaggle - Facebook is looking for talent. They won’t pay, but just might interview.
This post is in a way a bonus for active readers because most visitors of fastml.com originally come from Kaggle forums. For this competition the forums are disabled to encourage own work. To honor this, we won’t publish any code. But own work doesn’t mean original work, and we wouldn’t want to reinvent the wheel, would we?
Merck challenge
Today it’s about Merck challenge - let’s beat the benchmark real quick. Not by much, but quick.
Predicting closed questions on Stack Overflow
This time we enter the Stack Overflow challenge, which is about predicting a status of a given question on SO. There are five possible statuses, so it’s a multi-class classification problem.
We would prefer a tool able to perform multiclass classification by itself. It can be done by hand by constructing five datasets, each with binary labels (one class against all others), and then combining predictions, but it might be a bit tricky to get right - we tried. Fortunately, nice people at Yahoo, excuse us, Microsoft, recently relased a new version of Vowpal Wabbit, and this new version supports multiclass classification.