Data in, predictions out

With many implementations of machine learning algorithms it is entirely unclear how to train them on one’s own data and then how to get predictions. This is an area where AI researchers have a lot of catching up to do with lessons long ago learned in computer science.

Bad programmers worry about the code. Good programmers worry about data structures and their relationships. - Linus Torvalds (the creator of Linux)

Computer science is about dealing with complexity. The main means for this is abstraction, meaning building things from smaller blocks. The things themselves then become blocks for building even bigger things.

The basic and most important unit of abstraction is a function. A function is a black box that takes some inputs and returns some outputs. The whole idea is that you don’t need to know how things work inside, all you care about is the interface. Few people know how exactly a car works under the hood. Knowing how to drive it is enough.

The same goes to stuff in machine learning. From a user’s perspective, we would like to know how to format our data for training and then how to get predictions. It seems obvious, but unfortunately, practice shows otherwise. Time and time again we encounter implementations where data is just background for an algorithm.

We dedicated two articles to this very matter: Loading data in Torch (is a mess) and How to get predictions from Pylearn2. We like to think that ease of use, or lack of it, has something to do with long-term popularity, or lack of it, of a given software library.

As a more recent example, let’s look at Phased LSTM. The purpose of the model is to deal with asynchronous time series, where step size, or period between events, might differ. There are at least four implementations at Github, including the official one.

Naturally, since the point is to process irregularly sampled data, the first question would be how to represent such data. As an exercise, go figure this out.

Two of the implementations [1] [2] don’t bother with async inputs at all, they just use MNIST as an example of dealing with long time series - it’s the secondary usage scenario for the model.

The remaining two generate toy data on the fly, as is often the case with code accompanying a paper. In effect, there is no sample to look at. One needs to dig into the code, find the generator and run it to look at some data.

Tell me, Mr Anderson: what good is a program
if you’re unable to run it on your input?

Similarly, getting predictions from a model is often an afterthought. Some authors are content to compute a bunch of metrics and leave it at that. Why would anyone ever want to get actual predictions, right?

Show me your flowchart and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowchart; it’ll be obvious. Fred Brooks, The Mythical Man-Month

Show me your code and conceal your data structures, and I shall continue to be mystified. Show me your data structures, and I won’t usually need your code; it’ll be obvious. Eric S. Raymond, The Cathedral and The Bazaar

UPDATE: For more, see Engineering is the bottleneck in (deep learning) research, by Denny Britz, and Software engineering vs machine learning concepts, by Paul Mineiro.

FastML

Machine learning made easy

Data in, predictions out

Comments