Sometimes, fortunately not very often, we see people mention Weka as useful machine learning software. This is misleading, because Weka is just a toy: it can give a beginner a bit of taste of machine learning, but if you want to accomplish anything meaningful, there are many way better tools.
Deep learning made easy
As usual, there’s an interesting competition at Kaggle: The Black Box. It’s connected to ICML 2013 Workshop on Challenges in Representation Learning, held by the deep learning guys from Montreal.
There are a couple benchmarks for this competition and the best one is unusually hard to beat1 - only less than a fourth of those taking part managed to do so. We’re among them. Here’s how.
Regression as classification
An interesting development occured in Job salary prediction at Kaggle: the guy who ranked 3rd used logistic regression, in spite of the task being regression, not classification. We attempt to replicate the experiment.
Gender discrimination
There’s a contest at Kaggle held by Qatar University. They want to be able to discriminate men from women based on handwriting. For a thousand bucks, well, why not?
Dimensionality reduction for sparse binary data - an overview
Last time we explored dimensionality reduction in practice using Gensim’s LSI and LDA. Now, having spent some time researching the subject matter, we will give an overview of other options.
Large scale L1 feature selection with Vowpal Wabbit
The job salary prediction contest at Kaggle offers a highly-dimensional dataset: when you convert categorical values to binary features and text columns to a bag of words, you get roughly 240k features, a number very similiar to the number of examples.
We present a way to select a few thousand relevant features using L1 (Lasso) regularization. A linear model seems to work just as well with those selected features as with the full set. This means we get roughly 40 times less features for a much more manageable, smaller data set.
Choosing a machine learning algorithm
To celbrate the first 100 followers on Twitter, we asked them what would they like to read about here. One of the responders, Itamar Berger, suggested a topic: how to choose a ML algorithm for a task at hand. Well, what do we now?
Dimensionality reduction for sparse binary data
Much of data in machine learning is sparse, that is mostly zeros, and often binary. The phenomenon may result from converting categorical variables to one-hot vectors, and from converting text to bag-of-words representation. If each feature is binary - either zero or one - then it holds exactly one bit of information. Surely we could somehow compress such data to fewer real numbers.
Predicting advertised salaries
We’re back to Kaggle competitions. This time we will attempt to predict advertised salaries from job ads and of course beat the benchmark. The benchmark is, as usual, a random forest result. For starters, we’ll use a linear model without much preprocessing. Will it be enough?
The secret of the big guys
Are you interested in linear models, or K-means clustering? Probably not much. These are very basic techniques with fancier alternatives. But here’s the bomb: when you combine those two methods for supervised learning, you can get better results than from a random forest. And maybe even faster.