In this installment we will demonstrate how to turn text into numbers by a method known as a bag of words. We will also show how to train a simple neural network on resulting sparse data for binary classification. We will achieve the first feat with Python and scikit-learn, the second one with sparsenn. The example data comes from a Kaggle competition, specifically Stumbleupon Evergreen.
Accelerometer Biometric Competition
Can you recognize users of mobile devices from accelerometer data? It’s a rather non-standard problem (we’re dealing with time series here) and an interesting one. So we wrote some code and ran EC2 and then wrote more and ran EC2 again. After much computation we had our predictions, submitted them and achieved AUC = 0.83. Yay! But there’s a twist to this story.
Running things on a GPU
You’ve heard about running things on a graphics card, but have you tried it? All you need to taste the speed is a Nvidia card and some software. We run experiments using Cudamat and Theano in Python.
Introducing phraug
Recently we proposed to pre-process large files line by line. Now it’s time to introduce phraug*, a set of Python scripts based on this idea. The scripts mostly deal with format conversion (CSV, libsvm, VW) and with few other tasks common in machine learning.
Processing large files, line by line
Perhaps the most common format of data for machine learning is text files. Often data is too large to fit in memory; this is sometimes referred to as big data. But do you need to load the whole data into memory? Maybe you could at least pre-process it line by line. We show how to do this with Python. Prepare to read and possibly write some code.
Go non-linear with Vowpal Wabbit
Vowpal Wabbit now supports a few modes of non-linear supervised learning. They are:
- a neural network with a single hidden layer
- automatic creation of polynomial, specifically quadratic and cubic, features
- N-grams
We describe how to use them, providing examples from the Kaggle Amazon competition and for the kin8nm dataset.
Amazon aspires to automate access control
This is about Amazon access control challenge at Kaggle. Either we’re getting smarter, or the competition is easy. Or maybe both. You can beat the benchmark quite easily and with AUC of 0.875 you’d be comfortably in the top twenty percent at the moment. We scored fourth in our first attempt - the model was quick to develop and back then there were fewer competitors.
More on sparse filtering and the Black Box competition
The Black Box challenge has just ended. We were thoroughly thrilled to learn that the winner, doubleshot, used sparse filtering, apparently following our cue. His score in terms of accuracy is 0.702, ours 0.645, and the best benchmark 0.525.
We ranked 15th out of 217, a few places ahead of the Toronto team consisting of Charlie Tang and Nitish Srivastava. To their credit, Charlie has won the two remaining Challenges in Representation Learning.
And deliver us from Weka
Sometimes, fortunately not very often, we see people mention Weka as useful machine learning software. This is misleading, because Weka is just a toy: it can give a beginner a bit of taste of machine learning, but if you want to accomplish anything meaningful, there are many way better tools.
Deep learning made easy
As usual, there’s an interesting competition at Kaggle: The Black Box. It’s connected to ICML 2013 Workshop on Challenges in Representation Learning, held by the deep learning guys from Montreal.
There are a couple benchmarks for this competition and the best one is unusually hard to beat1 - only less than a fourth of those taking part managed to do so. We’re among them. Here’s how.