FastML

Machine learning made easy

It's embarassing, really

In August, we published the first version of goodbooks-10k, a new dataset for book recommendations. By pure chance, that coincided with a proclamation of Kaggle Datasets Awards. Oh, how we hoped to get one!

The prize announcement filled us with great sadness, as the Kaggle team chose datasets we would never suspect of having a chance. To be clear, there’s nothing wrong with them. You just could wonder if there weren’t any better choices among over 350 alternatives.

Kaggle can give their money to whomever they desire, so let’s familiarize ourselves with the criteria they declared:

Quality, impact, and reach.

The actual winners

The dataset that grasped the jury’s attention the most comes from a simulation of a robot holding a ball. This is certainly a groundbreaking work, as far as simulated robot arm kinematics are concerned [1] [2] [3]. Moreover, according to the justification, the dataset combines two exciting fields of research: robotics and deep learning. It does so in 20 numerical columns. Any child will tell you that any data with such characteristic positively calls for deep learning. Gradient boosting just wouldn’t work here.

Markers of engagement? Yuuuge: three likes (after the announcement) and 24 downloads. How’s that for impact.

The second dataset is Cryptocurrency Historical Prices, because the best we can hope for here is a dataset updated by hand once a week (for now, anyway). Not that there are any sources with current data available for instant download in a variety of formats, for Bitcoin [1] [2] [3], Ethereum [1] [2] [3], and any other God-forsaken cryptocurrency the Chinese entrepreneurs are warming the air with their ASICs for. And clearly, nothing of that nature have ever existed on Kaggle before [1] [2] [3] [4].

To be fair, though, this one got relatively many upvotes and downloads, as opposed to number one and three. At least some people liked it.

The third choice is perhaps the most curious for us. It’s a collection of favicons, tiny images that browsers use to represent websites in tabs, in the URL bar, and in bookmarks. 778 MB of favicons. What THAT has to do with data science? If this question leaves you scratching your head, well, apparently there are lots of opportunities to explore image processing and computer vision techniques in Kernels with this dataset. Yep, absolutely. The era of research on 32x32 images is just around the corner.

As you would expect, the community is in feeding frenzy over this one, as indicated by 6 likes total.

Kaggle Dataset Award Winners

There’s more

Twisting a knife in the wound, the scoring team also chose runners up: two riveting - well, not really - tweet archives concerning US domestic issues, and one offshore dataset which managed to spark even less interest: about people who haven’t turned up at job interviews in India, if you must know.

In contrast, these seemed to us like potential competitors to goodbooks-10k, considering originality, downloads and likes: [1] [2] [3]. Not to mention SURECOMMENDER’s antics [1] [2], which have been quite enjoyable to watch.

To further add insult to injury: goodbooks-10k didn’t even merit a mention, but the next day a notebook and recommender built on the dataset were chosen to receive a weekly kernel award. Go figure.

Let’s conclude with some good news: the datasets prizes, originally advertised as one-time event, will be awarded monthly from now on, through the end of the year! After this exciting first round we are definitely keen to partake, as we are sure many smart people are.

Comments