Vowpal Wabbit eats big data from the Criteo competition for breakfast

2014-07-16

The Criteo competition is about ad click prediction. The unpacked training set is 11 GB and has 45 million examples. While we’re not sure if it qualifies as the mythical big data, it’s quite big for Kaggle standards.

Unless you have an adequate machine, it will be difficult to process it in memory. Our solution is to use online or mini-batch learning, which deals with either one example or a small portion of examples at a time. Vowpal Wabbit is especially well suited for the contest for a number of reasons.

Optimizing hyperparams with hyperopt

2014-06-25

Very often performance of your model depends on its parameter settings. It makes sense to search for optimal values automatically, especially if there’s more than one or two hyperparams, as is in the case of extreme learning machines. Tuning ELM will serve as an example of using hyperopt, a convenient Python package by James Bergstra.

Extreme Learning Machines

2014-06-25

What do you get when you take out backpropagation out of a multilayer perceptron? You get an extreme learning machine, a non-linear model with the speed of a linear one.

How a Russian mathematician constructed a decision tree - by hand - to solve a medical problem

2014-06-08

Here’s an excerpt from Love and Math, a book by Edward Frenkel. The author writes about mathematics and his career. One of the stories is about how during his studies in the 80s he built a decision tree to help with kidney transplants. There was no machine to learn from data so humans had to do the work.

Yann LeCun’s answers from the Reddit AMA

2014-05-26

On May 15th Yann LeCun answered “ask me anything” questions on Reddit. We hand-picked some of his thoughts and grouped them by topic for your enjoyment.

Impute missing values with Amelia

2014-05-08

One of the ways to deal with missing values in data is to impute them. We use Amelia R package on The Analytics Edge competition data. Since one typically gets many imputed sets, we bag them with good results. So good that it seems we would have won the contest if not for a bug in our code.

Converting categorical data into numbers with Pandas and Scikit-learn

2014-04-30

Many machine learning tools will only accept numbers as input. This may be a problem if you want to use such tool but your data includes categorical features. To represent them as numbers typically one converts each categorical feature using “one-hot encoding”, that is from a value like “BMW” or “Mercedes” to a vector of zeros and one 1.

This functionality is available in some software libraries. We load data using Pandas, then convert categorical columns with DictVectorizer from scikit-learn.

Predicting happiness from demographics and poll answers

2014-04-21

This time we attempt to predict if poll responders said they’re happy or not. We also take a look at what features are the most important for that prediction.

Deep learning these days

2014-04-12

It seems that quite a few people with interest in deep learning think of it in terms of unsupervised pre-training, autoencoders, stacked RBMs and deep belief networks. It’s easy to get into this groove by watching one of Geoff Hinton’s videos from a few years ago, where he bashes backpropagation in favour of unsupervised methods that are able to discover the structure in data by themselves, the same way as human brain does. Those videos, papers and tutorials linger. They were state of the art once, but things have changed since then.

Exclusive Geoff Hinton interview

2014-04-01

Geoff Hinton is a living legend. He almost single-handedly invented backpropagation for training feed-forward neural networks. Despite in theory being universal function approximators, these networks turned out to be pretty much useless for more complex problems, like computer vision and speech recognition. Professor Hinton responded by creating deep networks and deep learning, an ultimate form of machine learning. Recently we’ve been fortunate to ask Geoff a few questions and have him answer them.

← Older Contents Newer →

FastML

Machine learning made easy