Here’s the final version of the script, with two ideas improving on our previous take: searching in product names and spelling correction.
Best Buy mobile contest - big data
Last time we talked about the small data branch of Best Buy contest. Now it’s time to tackle the big boy. It is positioned as “cloud computing sized problem”, because there is 7GB of unpacked data, vs. younger brother’s 20MB. This is reflected in “cloud computing” and “cluster” and “Oracle” talk in the forum, and also in small number of participating teams: so far, only six contestants managed to beat the benchmark.
But don’t be scared. Most of data mass is in XML product information. Training and test sets together are 378MB. Good news.
Best Buy mobile contest
There’s a contest on Kaggle called ACM Hackaton. Actually, there are two, one based on small data and one on big data. Here we will be talking about small data contest - specifically, about beating the benchmark - but the ideas are equally applicable to both.
The deal is, we have a training set from Best Buy with search queries and items which users clicked after the query, plus some other data. Items are in this case Xbox games like “Batman” or “Rocksmith” or “Call of Duty”. We are asked to predict an item user clicked given the query. Metric is MAP@5 (see an explanation of MAP).
Running Unix apps on Windows
When it comes to machine learning, most software seems to be in either Python, Matlab or R. Plus native apps, that is, compiled C/C++ - these are the fastest. Most of them are written for Unix environments, for example Linux or MacOS. So how do you run them on your computer if you have Windows installed?
Kaggle job recommendation challenge
This is an introduction to Kaggle job recommendation challenge. It looks a lot like a typical collaborative filtering thing (with a lot of extra information), but not quite.
What you wanted to know about Mean Average Precision
Let’s say that there are some users and some items, like movies, songs or jobs. Each user might be interested in some items. The client asks us to recommend a few items (the number is x) for each user. They will evaluate the results using mean average precision, or MAP, metric. Specifically MAP@x - this means they ask us to recommend x items for each user. So what is this MAP?