FastML

Machine learning made easy

Predicting closed questions on Stack Overflow

This time we enter the Stack Overflow challenge, which is about predicting a status of a given question on SO. There are five possible statuses, so it’s a multi-class classification problem.

We would prefer a tool able to perform multiclass classification by itself. It can be done by hand by constructing five datasets, each with binary labels (one class against all others), and then combining predictions, but it might be a bit tricky to get right - we tried. Fortunately, nice people at Yahoo, excuse us, Microsoft, recently relased a new version of Vowpal Wabbit, and this new version supports multiclass classification.

Best Buy mobile contest - big data

Last time we talked about the small data branch of Best Buy contest. Now it’s time to tackle the big boy. It is positioned as “cloud computing sized problem”, because there is 7GB of unpacked data, vs. younger brother’s 20MB. This is reflected in “cloud computing” and “cluster” and “Oracle” talk in the forum, and also in small number of participating teams: so far, only six contestants managed to beat the benchmark.

But don’t be scared. Most of data mass is in XML product information. Training and test sets together are 378MB. Good news.

Best Buy mobile contest

There’s a contest on Kaggle called ACM Hackaton. Actually, there are two, one based on small data and one on big data. Here we will be talking about small data contest - specifically, about beating the benchmark - but the ideas are equally applicable to both.

The deal is, we have a training set from Best Buy with search queries and items which users clicked after the query, plus some other data. Items are in this case Xbox games like “Batman” or “Rocksmith” or “Call of Duty”. We are asked to predict an item user clicked given the query. Metric is MAP@5 (see an explanation of MAP).

Running Unix apps on Windows

When it comes to machine learning, most software seems to be in either Python, Matlab or R. Plus native apps, that is, compiled C/C++ - these are the fastest. Most of them are written for Unix environments, for example Linux or MacOS. So how do you run them on your computer if you have Windows installed?

What you wanted to know about Mean Average Precision

Let’s say that there are some users and some items, like movies, songs or jobs. Each user might be interested in some items. The client asks us to recommend a few items (the number is x) for each user. They will evaluate the results using mean average precision, or MAP, metric. Specifically MAP@x - this means they ask us to recommend x items for each user. So what is this MAP?