This article is not about machine learning, but about a piece of computer science that often comes handy in data science practice. When writing code, everybody gets errors. Sometimes it is difficult to debug them. Using a debugger may help, but can also be intimidating. This is a TLDR tutorial in using pdb in IPython, focused on looking at variables inside functions.
We present a novel method for feature transformation, akin to standardization. The method comes from Michael Jahrer, who recently has won another competition and afterwards shared the approach he used.
Overfitting is on of the primary problems, if not THE primary problem in machine learning. There are many aspects to it, but in a general sense, overfitting means that estimates of performance on unseen test examples are overly optimistic. That is, a model generalizes worse then expected.
We explain two common cases of overfitting: including information from a test set in training, and the more insidious form: overusing a validation set.
There have been a few recommendations datasets for movies (Netflix, Movielens) and music (Million Songs), but not for books. That is, until now.
In this article, we revisit Numerai and their weekly data science tournament. New developments include a much larger dataset, tougher requirements for models, and bigger payouts.
In August, we published the first version of goodbooks-10k, a new dataset for book recommendations. By pure chance, that coincided with a proclamation of Kaggle Datasets Awards. Oh, how we hoped to get one!
Pointer networks are a variation of the sequence-to-sequence model with attention. Instead of translating one sequence into another, they yield a succession of pointers to the elements of the input series. The most basic use of this is ordering the elements of a variable-length sequence or set.
Once again we beat the benchmark in a Kaggle competition. The goal of the contest at hand was to forecast mortality rate in England using Copernicus Atmosphere Monitoring Service data on air quality. Specifically, to forecast mortality caused by cancer and cardiovascular diseases. The competition represents the “in class” category, because the data is publicly available somewhere on the internets. Still, the winner got a Raspberry Pi.
Hyperband is a relatively new method for tuning iterative algorithms. It performs random sampling and attempts to gain an edge by using time spent optimizing in the best way. We explain a few things that were not clear to us right away, and try the algorithm in practice.
It turns out that Converting categorical data into numbers with Pandas and Scikit-learn has become the most popular article on this site. Let’s revisit the topic and look at Pandas’ get_dummies() more closely.
Using the function is straightforward - you specify which columns you want encoded and get a dataframe with original columns replaced with one-hot encodings.