The Avito competition was about predicting illicit content in classified ads. It amounted to classifying text in Russian. We offer a review of what worked for top ranked participants and some opinions about how Kaggle competitions differ from the industry reality.
Since this article has multiple Russian accents, let’s make it clear that we’d much prefer Russia be in peace instead of waging war in Ukraine. It’s a tall order, but maybe in time Russian people will get better leadership.
The training set had roughly 1.3 million records, each consisting of a title, description, some attributes (key:value pairs), category and subcategory assignment and a few numeric features, including price.
Our solution was pretty typical for largish scale text classification (and regression, for that matter): use a couple Python scripts to get data in and out, let Vowpal Wabbit do the rest.
The evaluation metric was average precision, used for ranking. Avito best predictions benchmark was 0.927 and we have beaten it securely attaining 0.971, which would translate to roughly 25-th percentile on the final leaderboard. The winners got 0.987. The ROC AUC score was in similiar range. Note that these are pretty high numbers.
A broader view
We believe that when implementing machine learning for real, there’s a tradeoff between solution complexity and its performance in terms of score. In case of Avito, they would probably want top predictions flagged for moderator review, maybe deactivated automatically if the classifier is really sure. It probably doesn’t matter that much if the metric is 0.96 or 0.97 or 0.98 or 0.99. If there are really lots of ads to review, maybe it’s cheaper to hire another moderator than go for better scores.
On the other hand, for the heavyweights the leverage is so big that small improvements do make the difference, as noted in this talk: Sibyl: A System for Large Scale Machine Learning at Google.
The thing is, there’s a point of diminishing returns. Kaggle competitions are won by meticulously improving the score by a tiny fraction, usually by feature engineering, constructing different feature sets and ensembling the models. It would be rather hard to go this way in production.
One piece of evidence comes from the famous Netflix contest. People labored for three years to get the target score and most of the time was spent going that final mile. In the end it resulted in a great boost for matrix factorization and general machine learning research, but Netflix didn’t implement the winning solution:
We evaluated some of the new methods offline but the additional accuracy gains that we measured did not seem to justify the engineering effort needed to bring them into a production environment.
Not everything goes as smoothly as production of goo in North Korea.
What worked this time
Let’s say you’re not in the industry, you just want to go to the top of the leaderboard. How do people do it, specifically? Usually at the end contenders post information about their approach, and sometimes their code, on the competition forum. It’s a fantastic way to learn, especially if you took part yourself.
Multiple feature sets
It seems that the winners really went far in preparing different feature sets and learning different models on each set, finally combining predictions to get a better score. Heed the confessions by barisumog (1st place, with Giulio):
- extract raw text from each post by concatenating the title, description and attributes sections (We tried many other features, some worked for Giulio, but none for me. I used only textual features)
- for each category and subcategory, create 3 tf-idf matrices: one with raw text, one with stemming, and one with stop words (separately, they gave similar results, but I noticed they improved the score a bit and became more stable when combined)
- for each category and subcategory, train 2 sets of SVCs with different C parameters on each tf-idf (again, similar results separately, but slightly better when combined)
- so now I have 2 x 3 SVCs for every category, and 2 x 3 SVCs for every subcategory (12 models to use for every data point)
And Mikhail Trofimov (2nd place):
Our approach is quite similar to discribed by Giulio. We use different pieces of data (title, title+description, title+description+attrs, title+attrs) and made 3 levels of details for each (top100k word, all word, all pair of words). For all of this features-sets was trained SVM, for some - additional LibFM models. Solely they give 0.97 - 0.983.
TF-IDF
Term frequency - inverse document frequency is a method for pre-processing text. Its goal is to put more weight to infrequent words, especially if they occur much in a given document:
The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to control for the fact that some words are generally more common than others.
It payed to use it in this competition, unfortunately Vowpal Wabbit doesn’t have it implemented. Scikit-learn has a TfidfVectorizer.
Re-training a classifier on its own predictions
The winners employed a method they refer to as “semi-supervised learning”, probably for a lack of a better phrase. The idea is to train a model, get predictions for a test set, then use them as labels and re-train on train and test sets combined. The result is that the classifier gets more sure of its predictions, since it basically gets a positive feedback.
This technique has come up before in a competition. Yoshua Bengio said that It favors a low-density separation between classes, a commonly assumed prior for classification problems in machine learning. Here’s the relevant forum thread.
kNN
Curse of dimensionality? What curse of dimensionality? Apparently kNN with a big k and cosine similiarity gave good results. The downside: very slow. There are ways to speed it up, for example google-all-pairs-similarity-search.
Factorization machines
Michael Jahrer reports using libFM to get 0.98.
Separate models for each category
The ads fall into different categories and subcategories. It helped to treat them separately - we trained a model for each category. The model was basically the same as the general one, just trained on a subset of data. Our score went from 0.971 to 0.978.
Here’s how to get item counts for each category or category and subcategory in pandas:
data.pivot_table( columns = ['category'], values=['itemid'], aggfunc = 'count' )
data.pivot_table( columns = ['category', 'subcategory'], values=['itemid'], aggfunc = 'count' )
n-grams
We tried n-grams and while slightly better in validation, they didn’t improve the public score. Nonetheless people report using n-grams and quadratic features successfully.