In this article, we revisit Numerai and their weekly data science tournament. New developments include a much larger dataset, tougher requirements for models, and bigger payouts.
It’s embarassing, really
In August, we published the first version of goodbooks-10k, a new dataset for book recommendations. By pure chance, that coincided with a proclamation of Kaggle Datasets Awards. Oh, how we hoped to get one!
Introduction to pointer networks
Pointer networks are a variation of the sequence-to-sequence model with attention. Instead of translating one sequence into another, they yield a succession of pointers to the elements of the input series. The most basic use of this is ordering the elements of a variable-length sequence or set.
Project RHUBARB: predicting mortality in England using air quality data
Once again we beat the benchmark in a Kaggle competition. The goal of the contest at hand was to forecast mortality rate in England using Copernicus Atmosphere Monitoring Service data on air quality. Specifically, to forecast mortality caused by cancer and cardiovascular diseases. The competition represents the “in class” category, because the data is publicly available somewhere on the internets. Still, the winner got a Raspberry Pi.
Tuning hyperparams fast with Hyperband
Hyperband is a relatively new method for tuning iterative algorithms. It performs random sampling and attempts to gain an edge by using time spent optimizing in the best way. We explain a few things that were not clear to us right away, and try the algorithm in practice.
How to use pd.get_dummies() with the test set
It turns out that Converting categorical data into numbers with Pandas and Scikit-learn has become the most popular article on this site. Let’s revisit the topic and look at Pandas’ get_dummies() more closely.
Using the function is straightforward - you specify which columns you want encoded and get a dataframe with original columns replaced with one-hot encodings.
Data in, predictions out
With many implementations of machine learning algorithms it is entirely unclear how to train them on one’s own data and then how to get predictions. This is an area where AI researchers have a lot of catching up to do with lessons long ago learned in computer science.
On chatbots
Chatbots seem to be all the craze these days. Why don’t we take a look at this fascinating topic. A warning, though: this article contains strong opinions.
Piping in R and in Pandas
In R community, there’s this one guy, Hadley Wickam, who by himself made R great again. One of the many, many things he came up with - so many they call it a hadleyverse - is the dplyr package, which aims to make data analysis easy and fast. It works by allowing a user to take a data frame and apply to it a pipeline of operations resulting in a desired outcome (an example in just a minute). This approach turned out to be successful. Then people have ported key pieces to Pandas.
Deep learning architecture diagrams
As a wild stream after a wet season in African savanna diverges into many smaller streams forming lakes and puddles, so deep learning has diverged into a myriad of specialized architectures. Each architecture has a diagram. Here are some of them.