Sometimes, fortunately not very often, we see people mention Weka as useful machine learning software. This is misleading, because Weka is just a toy: it can give a beginner a bit of taste of machine learning, but if you want to accomplish anything meaningful, there are many way better tools.
As usual, there’s an interesting competition at Kaggle: The Black Box. It’s connected to ICML 2013 Workshop on Challenges in Representation Learning, held by the deep learning guys from Montreal.
There are a couple benchmarks for this competition and the best one is unusually hard to beat1 - only less than a fourth of those taking part managed to do so. We’re among them. Here’s how.
An interesting development occured in Job salary prediction at Kaggle: the guy who ranked 3rd used logistic regression, in spite of the task being regression, not classification. We attempt to replicate the experiment.
There’s a contest at Kaggle held by Qatar University. They want to be able to discriminate men from women based on handwriting. For a thousand bucks, well, why not?
Last time we explored dimensionality reduction in practice using Gensim’s LSI and LDA. Now, having spent some time researching the subject matter, we will give an overview of other options.
The job salary prediction contest at Kaggle offers a highly-dimensional dataset: when you convert categorical values to binary features and text columns to a bag of words, you get roughly 240k features, a number very similiar to the number of examples.
We present a way to select a few thousand relevant features using L1 (Lasso) regularization. A linear model seems to work just as well with those selected features as with the full set. This means we get roughly 40 times less features for a much more manageable, smaller data set.
To celbrate the first 100 followers on Twitter, we asked them what would they like to read about here. One of the responders, Itamar Berger, suggested a topic: how to choose a ML algorithm for a task at hand. Well, what do we now?
Much of data in machine learning is sparse, that is mostly zeros, and often binary. The phenomenon may result from converting categorical variables to one-hot vectors, and from converting text to bag-of-words representation. If each feature is binary - either zero or one - then it holds exactly one bit of information. Surely we could somehow compress such data to fewer real numbers.
We’re back to Kaggle competitions. This time we will attempt to predict advertised salaries from job ads and of course beat the benchmark. The benchmark is, as usual, a random forest result. For starters, we’ll use a linear model without much preprocessing. Will it be enough?
Are you interested in linear models, or K-means clustering? Probably not much. These are very basic techniques with fancier alternatives. But here’s the bomb: when you combine those two methods for supervised learning, you can get better results than from a random forest. And maybe even faster.