Machine learning made easy

Kaggle job recommendation challenge

This is an introduction to Kaggle job recommendation challenge. It looks a lot like a typical collaborative filtering thing (with a lot of extra information), but not quite. Spot these two big differences:

  • There are no explicit ratings. Instead, there’s info about which jobs user applied to. This is known as one-class collaborative filtering (OCCF), or learning from positive-only feedback.

    If you want to dig deeper into the subject, there have been already contests with positive feedback only, for example track two of Yahoo KDD Cup or Millions Songs Dataset Challenge at Kaggle (both about songs).

  • The second difference is less apparent. When you look at test users (that is, the users that we are asked to recommend jobs for), only about half of them made at least one application. For the other half, no data and no collaborative filtering.

For the users we have applications data for, it’s very sparse, so we would like to use CF, because it does well in similar settings. To address the issues above, we need software for OCCF and a way to handle the remaining users.

As far as software goes, we can use MyMediaLite. It has an item recommendation tool that fits our needs exactly, has some nice algorithms and is very convenient: it takes a list of user/item IDs and produces recommendations for users we are interested in. Some alternatives exists, for example Graphlab’s pmf can handle implicit feedback.

Recommendations for users who didn’t make any applications? Easy as pie. We will just take them from the benchmark!

So, if you lacked inspiration, there you have it. Take these ideas and go beat the benchmark.

Finally, a note about data sparsity. The data consists of seven “windows”. We will treat them separately, because each user and each job is assigned exactly to one window, so there is no overlap between windows.

Let’s take applications from window two. There are about 50k unique users and 50k unique jobs, and only 200k examples, that is user/job pairs.

That means that data is very sparse: its density is roughly 0,01%. This is a bit small even for CF. Compare with Netflix challenge, which had more users, but also way more ratings (100M), so the density was about 1%, 100 times bigger than here. So, the question is: what to do about it?