We have a few ideas about what to write about next and are looking for your feedback. Vote in the poll at the bottom of this post.
Many data science competitions suffer from the test set being markedly different from a training set (a violation of the “identically distributed” assumption). It is then difficult to make a representative validation set. We propose a method for selecting training examples most similar to test examples and using them as a validation set. The core of this idea is training a probabilistic classifier to distinguish train / test examples.
Bayesian machine learning
So you know the Bayes rule. How does it relate to machine learning? It can be quite difficult to grasp how the puzzle pieces fit together - we know it took us a while. This article is an introduction we wish we had back then.
Challenges in deep learning
For us, there are two major challenges facing deep learning: computational demands and cognitive demands. By cognitive demands we mean that stuff is getting complicated. We take a look at the situation and how people go about dealing with computational demands.
Conformal prediction for classification
Conformal prediction is related to classifier calibration. The basic premise is that you get guaranteed max. error rate (false negatives, to be exact), and you set that rate as low or as high as you’re willing to tolerate. The catch is, you may get multiple classes assigned to an example: in binary classification, a point can be labelled both positive and negative.
Handling big data files
The Genentech competition made available rather large data files containing complete medical history of a few million patients. The biggest three were roughly 50GB on disk and 500 million examples each. How to handle such files, specifically how to run GROUP BY operations? We considered two choices: a relational database or Pandas. We went with Pandas. It didn’t quite work even when using a machine with enough RAM, but we found a way.
Now that you know the options, please cast your vote for what you would like to read about next.
UPDATE: Voters clearly seem to prefer an article about Bayesian machine learning, so it’s coming. Posts on the other subjects may appear to, possibly in shorter-than-usual form.