Last time we talked about the small data branch of Best Buy contest. Now it’s time to tackle the big boy. It is positioned as “cloud computing sized problem”, because there is 7GB of unpacked data, vs. younger brother’s 20MB. This is reflected in “cloud computing” and “cluster” and “Oracle” talk in the forum, and also in small number of participating teams: so far, only six contestants managed to beat the benchmark.
But don’t be scared. Most of data mass is in XML product information. Training and test sets together are 378MB. Good news.
The really interesting thing is that we can take the script we’ve used for small data and apply it to this contest, obtaining 0.355 in a few minutes (benchmark is 0.304). Not impressed? With simple extension you can up the score to 0.55. Read below for details.
This is the very same script, with one difference. In this challenge, benchmark recommendations differ from line to line, so we can’t just hard-code five item IDs like before. Instead, we will read the benchmark file in parallel with test file, so that when we need benchmark items, we have them handy:
train.py <train file> <test file> <benchmark file> <output file>
train.py train.csv test.csv popular_skus.csv predictions.txt
Main difference between the two contests, except data size, is that here we’re dealing with many product categories, not just Xbox games. The benchmark recommends most popular products in a given category, not globally.
If we build our query -> product mapping taking categories into account, the score will go up dramatically, as promised. This is left as an exercise for the reader, as some of those academic types say. Have fun!