Computer science is about dealing with complexity. The main means for this is abstraction, meaning building things from smaller blocks. The things themselves then become blocks for building even bigger things.

The basic and most important unit of abstraction is a function. A function is a black box that takes some inputs and returns some outputs. The whole idea is that you don’t need to know how things work inside, all you care about is the interface. Few people know how exactly a car’s engine works. Knowing how to turn it on, off, and put some fuel in is enough.

The same goes to stuff in machine learning. From a user’s perspective, we would like to know how to format our data for training and then how to get predictions. It seems obvious, but unfortunately, practice shows otherwise. Time and time again we encounter implementations where data is just background for an algorithm.

Show me your flowchart and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowchart; it’ll be obvious. – Fred Brooks, The Mythical Man-Month

We dedicated two articles to this very matter: Loading data in Torch (is a mess) and How to get predictions from Pylearn2. Now for some brutality: it’s not a big surprise that Pylearn2 has been officially dead for some time and Torch only lives because it’s LeCun’s students’ brainchild.

As a more recent example, let’s look at Phased LSTM. The purpose of the model is to deal with asynchronous time series, where step size, or period between events, might differ. There are at least four implementations at Github, including the official one.

Naturally, since the point is to process irregularly sampled data, the first question would be how to represent such data. As an exercise, go figure this out.

Show me your code and conceal your data structures, and I shall continue to be mystified. Show me your data structures, and I won’t usually need your code; it’ll be obvious. – Eric S. Raymond, The Cathedral and The Bazaar

Two of the implementations [1] [2] don’t bother with async inputs at all, they just use MNIST as an example of dealing with long time series - that’s the secondary usage scenario.

The remaining two generate toy data on the fly, as is often the case with code accompanying a paper. In effect, there is no sample to look at, one needs to dig into the code, find the generator and run it to look at some data.

Bad programmers worry about the code. Good programmers worry about data structures and their relationships. - Linus Torvalds (the creator of Linux)

Similarly, getting predictions from a model is often an afterthought. Some authors are content to compute a bunch of metrics and leave it at that. Why would anyone ever want to get actual predictions, right?

Tell me, Mr Anderson: what good is a program

if you’re unable to run it on your input?

The popularity of chatbots is coming from a few sources, apparently. One, they exemplify the AI dream. Two, making a conversational bot is a fun technical challenge. And three: for many, maybe most, businesses, labour constitutes the biggest cost. Therefore corporations salivate at the prospect of exchanging humans, if only in the internet chat channel, for computers.

Image credit: @0x7000

(Un)fortunately, we’re still very, very far - like 25 years - from strong AI, which in practical terms means that chatbots are crap. You won’t have a meaningful conversation with a computer. All it can do in a customer support role is provide some information and maybe sell you something. A chatbot is like an inflatable doll: cheap and available, but not much else.

Source: @sdw

Now let’s step back from this bland perspective and look at two very successful interfaces that involve typing: a Unix shell and google.com. Let us notice that in a way, they are an opposite to chatbots: they don’t pretend to have any intelligence, and while they use something based on natural language (because what else is there?), they are maximally succinct and to the point in terms of human input. For example, you don’t say,

```
$ List all files in this directory
```

You say

```
$ ls
```

Similarly, you are unlikely to have the following dialogue, at least when typing:

```
$ Hey Google, what's the temperature outside?
And where are you located, sir?
$ I'm in Singapore
It's 79 degrees Fahrenheit.
$ In Celcius, dang nabbit!
Sorry sir. It's 26 degrees Celcius.
```

Instead, one types “temperature Singapore” and when presented with a weather dashboard, clicks C. By the way, neither Singapore nor our location uses Fahrenheit scale (only USA and a few banana republics nearby do), so why does Google show us F?

From this little thought experiment we would conclude that the ideal chatbot is, in essence, artificial intelligence using natural language in written form as a communication channel. The critical AI component is just out of reach:

Human level AI is always just 25 years away. Source: PDF

That leaves only wordiness, and few like to type more than needed (except maybe Java programmers and the guy who designed infix operators in R). Speak, yes, but not type. That makes chatbots similar to communism: promising in theory, dismal in practice. If you know any examples to the contrary, we’d be delighted to get to know them.

Here’s Mat Kelcey proving us wrong:

Man, these Australians…

]]>

There’s an interesting story about how Hadley invented all those things. It goes like this. An angel - some say it was a daemon, but don’t believe them, it was an angel - visited our hero in a dream and said: ”**I will give you some ideas that will make you rich and famous - well, rich intellectually and famous in the R community - but there’s a catch. For reasons I won’t disclose, for a mortal like you wouldn’t really understand, you must make the piping operator as bad as you possibly can, but without rendering it outright ridiculous. No more than three chars, and remember - as ugly and hard to type as you can.**”

Hadley agreed and woke up. Being a smart guy that he is, in the morning he constructed an appropriate bundle of characters. Reportedly, the thought process unfolded as follows:

Okay, three characters. Let’s invert that old

`<-`

, and elongate it:`-->`

. Meh, waaay too pretty and you type it all with one hand.

I know…`~~>`

is good.`~>~`

even better. One needs to press tilde key twice, then delete one, go to`>`

, repeat, all with Shift pressed. Sweet. But dang, too pretty. I need ugly.

Think, man, think! Let’s see.`#>#`

. Yeah. Nah, that looks half-reasonable. Wait… wait… yes…`%>%`

. That’s it!

Image credit: Dexter’s Laboratory

And the rest is history:

```
carriers_db2 %>% summarise(delay = mean(arr_delay)) %>% collect()
```

Image credit: Natalie Cooper

But seriously, the man in question says that’s all because infix operators in R must have the form `%something%`

. By the way, Hadley has at least two books online: Advanced R and R for Data Science.

We know of three modules for piping in Pandas: pandas-ply, dplython and dfply. All three use reasonable piping operators.

**pandas-ply**, from Coursera, is the simplest of them and closest to the Pandas spirit. It uses a normal dot for chaining and just adds a few methods to the DataFrame. Here’s their motivating example, adapted from the *dplyr* intro:

```
grouped_flights = flights.groupby(['year', 'month', 'day'])
output = pd.DataFrame()
output['arr'] = grouped_flights.arr_delay.mean()
output['dep'] = grouped_flights.dep_delay.mean()
filtered_output = output[(output.arr > 30) & (output.dep > 30)]
# instead:
(flights
.groupby(['year', 'month', 'day'])
.ply_select(
arr = X.arr_delay.mean(),
dep = X.dep_delay.mean())
.ply_where(X.arr > 30, X.dep > 30))
```

Less typing and no need for intermediate artifacts. Notice how you refer to the transformed dataframe inside the pipeline by X.

**dplython** is closer to *dplyr*. The module provides verbs (functions) similar to the R counterpart, but the pipeline operator is a handsome *>>*, and there’s this nice *diamonds* dataset:

```
(diamonds >>
sample_n(10) >>
arrange(X.carat) >>
select(X.carat, X.cut, X.depth, X.price))
(diamonds >>
mutate(carat_bin=X.carat.round()) >>
group_by(X.cut, X.carat_bin) >>
summarize(avg_price=X.price.mean()))
```

What’s with the outer parens? Is this Lisp or something?

Let us mention that in R, all functions are pipable. In Python, you need to make them pipable. dlpython has a special decorator for it, *@DelayFunction*.

Finally, there is **dfply**, inspired by dlpython, but with even more functions. It appears less mature than the previous two - *pip install* won’t work here. The example shows some means of deleting columns from a frame:

```
diamonds >> drop_endswith('e','y','z') >> head(2)
```

What these modules provide is mostly syntactic sugar, and using them depends on a personal taste. For example, while the pandas-ply flights example above is convincing, is one of these lines better than the others?

```
diamonds.ply_where(X.carat > 4).ply_select('carat', 'cut', 'depth', 'price')
diamonds >> sift(X.carat > 4) >> select(X.carat, X.cut, X.depth, X.price)
diamonds[diamonds.carat > 4]['carat', 'cut', 'depth', 'price']
```

We’d like automatic X in pandas:

```
diamonds[X.carat > 4][X.carat, X.cut, X.depth, X.price]
```

Apparently it was Stefan Milton Bache who invented the piping operator in the package magrittr. He’s Danish; they have the most complicated and difficult language in Europe, except Hungarians. Danish don’t mind such trivial inconveniences as `%>%`

. By the way, there’s more: `%T>%`

, `%$%`

, `%<>%`

.

Neural networks are conceptually simple, and that’s their beauty. A bunch of homogenous, uniform units, arranged in layers, weighted connections between them, and that’s all. At least in theory. Practice turned out to be a bit different. Instead of feature engineering, we now have architecture engineering, as described by Stephen Merrity:

The romanticized description of deep learning usually promises that the days of hand crafted feature engineering are gone - that the models are advanced enough to work this out themselves. Like most advertising, this is simultaneously true and misleading.

Whilst deep learning has simplified feature engineering in many cases, it certainly hasn’t removed it. As feature engineering has decreased, the architectures of the machine learning models themselves have become increasingly more complex. Most of the time, these model architectures are as specific to a given task as feature engineering used to be.

To clarify, this is still an important step. Architecture engineering is more general than feature engineering and provides many new opportunities. Having said that, however, we shouldn’t be oblivious to the fact that where we are is still far from where we intended to be.

Not quite as bad as doings of architecture astronauts, but not too good either.

An example of architecture specific to a given task

How to explain those architectures? Naturally, with a diagram. A diagram will make it all crystal clear.

Let’s first inspect the two most popular types of networks these days, CNN and LSTM. You’ve already seen a convnet diagram, so turning to the iconic LSTM:

It’s easy, just take a closer look:

As they say, in mathematics you don’t understand things, you just get used to them.

Fortunately, there are good explanations, for example Understanding LSTM Networks and Written Memories: Understanding, Deriving and Extending the LSTM.

LSTM still too complex? Let’s try a simplified version, GRU (Gated Recurrent Unit). Trivial, really.

Especially this one, called *minimal GRU*.

Various modifications of LSTM are now common. Here’s one, called deep bidirectional LSTM:

DB-LSTM, PDF

The rest are pretty self-explanatory, too. Let’s start with a combination of CNN and LSTM, since you have both under your belt now:

Convolutional Residual Memory Network, 1606.05262

Dynamic NTM, 1607.00036

Evolvable Neural Turing Machines, PDF

Unsupervised Domain Adaptation By Backpropagation, 1409.7495

Deeply Recursive CNN For Image Super-Resolution, 1511.04491

Recurrent Model Of Visual Attention, 1406.6247

This diagram of multilayer perceptron with synthetic gradients scores high on clarity:

MLP with synthetic gradients, 1608.05343

Every day brings more. Here’s a fresh one, again from Google:

Google’s Neural Machine Translation System, 1609.08144

Drawings from the Neural Network ZOO are pleasantly simple, but, unfortunately, serve mostly as eye candy. For example:

ESM, ESN and ELM

These look like not-fully-connected perceptrons, but are supposed to represent a *Liquid State Machine*, an *Echo State Network*, and an *Extreme Learning Machine*.

How does LSM differ from ESN? That’s easy, it has green neuron with triangles. But how does ESN differ from ELM? Both have blue neurons.

Seriously, while similar, ESN is a recurrent network and ELM is not. And this kind of thing should probably be visible in an architecture diagram.

You haven’t seen anything till you’ve seen the Neural Compiler.

]]>A Neural Compiler https://t.co/HbAOvoiBaS pic.twitter.com/cmPFFT832m

— Misha Denil (@notmisha) November 2, 2016

swibe: In traditional convolution layers, the convolution is tied up with cross-channel pooling: for each output channel, a convolution is applied to each input channel and the results are summed together.

This leads to the unfortunate situation where the network may often be repeatedly applying similar filters to each input channel. Storing these filters wastes memory, and applying them repeatedly wastes computation.

It’s possible to instead split the computation into a convolution stage, where multiple filters are applied to each input channel, and a cross-channel pooling stage where the output channels each use whichever intermediate results are of use to them. It allows the filters to be shared across multiple output channels. This reduces their number substantially, allowing for more efficient networks, or larger networks with a similar computational cost.

benanne: This is not a particularly novel idea, so I get the feeling that some references are missing. It’s been available in TensorFlow as `tf.nn.separable_conv2d()`

for a while, and in this presentation Vincent Vanhoucke [slides, video] also discusses it (slide 26 and onwards).

The results seem to be pretty solid though, and the interaction with residual connections probably makes it more practical. It’s a nice way to further increase depth and nonlinearity while keeping the computational cost and risk of overfitting at reasonable levels.

]]>Now, you can use Cubert to make these beauties. However, if you’re more of a do-it-yourself type, here’s a HOWTO.

Let’s say you’ve performed dimensionality reduction with a method of your choosing and have some data points looking like this:

```
cid,x,y,z
1.0,0.131364496515,-0.590685372085,-1.00062387318
-1.0,-1.90206919581,-0.0518527188196,-1.01665336703
1.0,2.29749236265,-0.982830132008,0.0511009011955
```

First goes the class label and then the three dimensions. The software we use, data-projector, needs a JSON file:

```
{"points": [
{"y": "-79.0866574", "x": "-3.15971493", "z": "-98.5084333", "cid": "1.0"},
{"y": "-50.3503514", "x": "-100.0", "z": "-100.0", "cid": "0.0"},
{"y": "-100.0", "x": "100.0", "z": "-0.643983041", "cid": "1.0"}
]}
```

The dimensions in the cube go from -100 to 100, so we rescale the data accordingly:

```
d = pd.read_csv( input_file )
assert set( d.columns ) == set([ 'cid', 'x', 'y', 'z' ])
scaler = MinMaxScaler( feature_range=( -100, 100 ))
d[[ 'x', 'y', 'z' ]] = scaler.fit_transform( d[[ 'x', 'y', 'z' ]])
```

If our labels are in order (starting from 0), we’re ready to save to JSON:

```
d_json = { 'points': json.loads( d.astype( str ).to_json( None, orient= 'records' )) }
json.dump( d_json, open( output_file, 'wb' ))
```

Why the acrobatics in the first line? We could save directly with:

```
d.astype( str ).to_json( output_file, orient = 'records' )
```

The reason is that we need to wrap the data in a dictionary with one key called ‘points’. Therefore, we:

- convert the data frame to json to a string
- load it into a JSON object
- dump the object to a file

The complete code is available at GitHub.

Now move `data.json`

to the `data-projector`

directory and open `index.html`

with your browser. That is, if your browser happens to be Firefox.

If you’re using Chrome, you’ll need to access `index.html`

through HTTP, because apparently Chrome policy doesn’t allow loading data from external files when opening a file from a local disk.

Yasser Souri gives one solution:

- open a console and in the
`data-projector`

directory - type
`python -m SimpleHTTPServer 80`

(assuming python 2.x) - open
*http://localhost*in your browser

The problem with training examples being different from test examples is that validation won’t be any good for comparing models. That’s because validation examples originate in the training set.

We can see this effect when using Numerai data, which comes from financial time series. We first tried logistic regression and got the following validation scores:

```
LR
AUC: 52.67%, accuracy: 52.74%
MinMaxScaler + LR
AUC: 53.52%, accuracy: 52.48%
```

What about a more expressive model, like logistic regression with polynomial features (that is, feature interactions)? They’re easy to create with *scikit-learn*:

```
from sklearn.pipeline import make_pipeline
poly_scaled_lr = make_pipeline( PolynomialFeatures(), MinMaxScaler(), LogisticRegression())
```

This pipeline looked much better in validation than plain logistic regression, and also better than *MinMaxScaler + LR* combo:

```
PolynomialFeatures + MinMaxScaler + LR
AUC: 53.62%, accuracy: 53.04%
```

So that’s a no-brainer, right? Here are the actual leaderboard scores (from the earlier round of the tournament, using AUC):

```
# AUC 0.51706 / LR
# AUC 0.52781 / MinMaxScaler + LR
# AUC 0.51784 / PolynomialFeatures + MinMaxScaler + LR
```

After all, poly features do about as well as plain LR. Scaler + LR seems to be the best option.

We couldn’t tell that from validation, so it appears that we can’t trust it for selecting models and their parameters.

We’d like to have a validation set representative of the Numerai test set. To that end, we’ll take care to select examples for the validation set which are the most similar to the test set.

Specifically, we’ll run the distinguishing classifier in cross-validation mode, to get predictions for all training examples. Then we’ll see which training examples are misclassified as test and use them for validation.

To be more precise, we’ll choose a number of misclassified examples that the model was most certain about. It means that they look like test examples but in reality are training examples.

Numerai data after PCA. Training set in red, test in turquoise. Quite regular, shaped like a sphere…

Or maybe like a cube? Anyway, sets look difficult to separate.

UPDATE:Now you can create 3D visualizations of your own data sets. Visit cubert.fastml.com and upload a CSV or libsvm-formatted file.

First, let’s try training a classifier to tell train from test, just like we did with the Santander data. Mechanics are the same, but instead of 0.5, we get 0.87 AUC, meaning that the model is able to classify the examples pretty well (at least in terms of AUC, which measures ordering/ranking).

By the way, there are only about 50 training examples that random forest misclassifies as test examples (assigning probability greater than 0.5). We work with what we have and mostly care about the order, though.

Cross-validation provides predictions for all the training points. Now we’d like to sort the training points by their estimated probability of being test examples.

```
i = predictions.argsort()
train['p'] = predictions
train_sorted = train.iloc[i]
```

We did the ascending sort, so for validation we take a desired number of examples from the end:

```
val_size = 5000
train = data.iloc[:-val_size]
val = data.iloc[-val_size:]
```

The current evaluation metric for the competition is log loss. We’re not using a scaler with LR anymore because the data is already scaled. We only scale after creating poly features.

```
LR
AUC: 52.54%, accuracy: 51.96%, log loss: 69.22%
Pipeline(steps=[('poly', PolynomialFeatures(degree=2, include_bias=True, interaction_only=False)), ('scaler', MinMaxScaler(copy=True, feature_range=(0, 1)))])
AUC: 52.57%, accuracy: 51.76%, log loss: 69.58%
```

Let us note that differences between models in validation are pretty slim. Even so, the order is correct - we would choose the right model from the validation scores. Here’s the summary of results achieved for the two models:

Validation:

```
# 0.6922 / LR
# 0.6958 / PolynomialFeatures + MinMaxScaler + LR
```

Public leaderboard:

```
# 0.6910 / LR
# 0.6923 / PolynomialFeatures + MinMaxScaler + LR
```

And the private leaderboard at the end of the May round:

```
# 0.6916 / LR
# 0.6954 / PolynomialFeatures + MinMaxScaler + LR
```

As you can see, our improved validation scores translate closely into the private leaderboard scores.

- Train a classifier to identify whether data comes from the train or test set.
- Sort the training data by it’s probability of being in the test set.
- Select the training data most similar to the test data as your validation set.

*(By Jim Fleming)*

Still, it’s a drag to model upper and lower case separately. It adds to dimensionality, and perhaps more importantly, a network gets no clue that ‘a’ and ‘A’ actually represent pretty much the same thing.

The simplest solution is to discard uppercase and just use lowercase. We propose a more elegant way to deal with the two problems mentioned above: **inserting special markers before each uppercase letter**.

```
Hello World -> ^hello ^world
```

The resulting text is still quite readable. Of course you need to make sure there are no carets in your input to start with, but this is a minor matter: one could use any character as a marker, or invent one. Remember that a char is just a sparse vector: we can make it longer by one element, and that abstract element can be our marker.

Gents witnessing the emergence of R33, the very first char-RNN, back in the day

Here’s how to convert mixed-case text:

```
s = 'Hello World'
re.sub( '([A-Z])', '^\\1', s ).lower()
```

What we do is insert a caret befor each uppercase letter and then turn the whole string to lowercase (`\1`

is a *backreference* to a subgroup marked by parens in the first pattern; we need to quote the backslash, hence `\\1`

). An alternative is to perform both operations inside `sub()`

using a function to modify the match and return a replacement:

```
re.sub( '([A-Z])', lambda match: "^" + match.group( 1 ).lower(), s )
```

Should we need to convert stuff back, we’d use a similar construct:

```
s = '^hello ^world'
re.sub( '\^(.)', lambda match: match.group( 1 ).upper(), s )
```

A caret means “start of the line” in a regular expression, so we need to quote it with a backslash.

Does it work? It does. The network is especially quick to learn `. ^`

combo, representing the end of a sentence and an uppercase letter at the beginning of the next one.

The trick described above is meant for text. People have used char-RNNs for modelling other stuff. It is conceivable to use a similar gimmick for source code, or music, for example to insert bar markers - that might help a network to learn the rhytm.

]]>In part one, we inspect the ideal case: training and testing examples coming from the same distribution, so that the validation error should give good estimation of the test error and classifier should generalize well to unseen test examples.

In such situation, if we attempted to train a classifier to distinguish training examples from test examples, it would perform no better than random. This would correspond to ROC AUC of 0.5.

Does it happen in reality? It does, for example in the Santander Customer Satisfaction competition at Kaggle.

We start by setting the labels according to the task. It’s as easy as:

```
train = pd.read_csv( 'data/train.csv' )
test = pd.read_csv( 'data/test.csv' )
train['TARGET'] = 1
test['TARGET'] = 0
```

Then we concatenate both frames and shuffle the examples:

```
data = pd.concat(( train, test ))
data = data.iloc[ np.random.permutation(len( data )) ]
data.reset_index( drop = True, inplace = True )
x = data.drop( [ 'TARGET', 'ID' ], axis = 1 )
y = data.TARGET
```

Finally we create a new train/test split:

```
train_examples = 100000
x_train = x[:train_examples]
x_test = x[train_examples:]
y_train = y[:train_examples]
y_test = y[train_examples:]
```

Come to think of it, there’s a shorter way (no need to shuffle examples beforehand, too):

```
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split( x, y, train_size = train_examples )
```

Now we’re ready to train and evaluate. Here are the scores:

```
# logistic regression / AUC: 49.82%
# random forest, 10 trees / AUC: 50.05%
# random forest, 100 trees / AUC: 49.95%
```

Train and test are like two peas in a pod, like Tweedledum and Tweedledee - indistinguishable to our models.

Below is a 3D interactive visualization of the combined train and test sets, in red and turquoise. They very much overlap. Click the image to view the interactive version (might take a while to load, the data file is ~8MB).

**UPDATE**: The hosting provider which shall remain unnamed has taken down the account with visualizations. We plan to re-create them on Cubert. In the meantime, you can do so yourself.

Santander training set after PCA

UPDATE:It’s a-live! Now you can create 3D visualizations of your own data sets. Visit cubert.fastml.com and upload a CSV or libsvm-formatted file.

Let’s see if validation scores translate into leaderboard scores, then. We train and validate logistic regression and a random forest. LR gets **58.30**% AUC, RF **75.32**% (subject to randomness).

On the private leaderboard LR scores **61.47**% and RF **74.37**%. These numbers correspond pretty well to the validation results.

The code is available at GitHub.

In part two, due in two weeks, we’ll see what we can do when train and test differ.

]]>

Here’s a hint. Think about the following:

- Have you ever seen a photo of Zygmunt?
- Have you ever met Zygmunt at a conference or heard him speak there?
- Have you ever read any papers by Zygmunt?

That’s right. There’s no Zygmunt the Polish economist ever willing to relocate to San Francisco.

And the “we” that we always use in the posts is not majestic plural. **We** are three Chinese PhD students: Ah, Hai, and Wang*.

To keep the style consistent, Ah does most of the writing and all of the editing. He likes to stay on top of state of the art, runs @fastml_extra and can be a bit shy sometimes. Ah has a (no-longer) secret crush on Anima Anandkumar.

Hai doesn’t care about state of the art or any fancy methods. He just wants to get stuff done. He also doesn’t mind things like data wrangling, feature engineering, and plotting. Hai likes to help people, writes most of the code and is the principal author of phraug.

Wang has a confrontative style and strong opinions about everything. He hates big data, Hadoop, Spark, Weka, Java, Facebook, Google, TensorFlow, AI, distributed systems, you name it. One entity Wang tolerates is Kaggle.

Unfortunately, he has the strongest command of spoken English and almost no accent, so if you are one of the few who have spoken to Zygmunt over Skype, sorry - it was Wang.

In case you exchanged any written messages with Zygmunt, it might have been any one of us typing. We usually route our traffic through Poland to keep up appearances.

Why the deception? China, aspiring to the role of global superpower, tends to evoke strong feelings. At the same time, many people in the west unfortunately disregard Chinese scientists and their work. Therefore we adopted a neutral guise so as not to distract readers from the content.

To preserve the cover, we tried to abstain from mentioning Chinese creations. We think we mostly succeeded - we have only written about Extreme Learning Machines, Liblinear, and stuff by DMLC guys: XGBoost and MXNet.

Keeping our mouths shut that way was hard, because Chinese have a lot to offer, both in traditional machine learning and in deep learning - even when counting only those in continental China. There’s been some very advanced research going on, for example on topic models and dimensionality reduction [1] [2] [3] [4] [5]. Ignore it at your own peril.

Let your plans be dark and impenetrable as night, and when you move, fall like a thunderbolt.

–Sun Tzu, The Art of War

*Authors contribute unequally.

]]>

*While we have some grasp on the matter, we’re not experts, so the following might contain inaccuracies or even outright errors. Feel free to point them out, either in the comments or privately.*

In essence, Bayesian means probabilistic. The specific term exists because there are two approaches to probability. Bayesians think of it as a measure of belief, so that probability is subjective and refers to the future.

Frequentists have a different view: they use probability to refer to past events - in this way it’s objective and doesn’t depend on one’s beliefs. The name comes from the method - for example: we tossed a coin 100 times, it came up heads 53 times, so the frequency/probability of heads is 0.53.

For a thorough investigation of this topic and more, refer to Jake VanderPlas’ Frequentism and Bayesianism series of articles.

As Bayesians, we start with a belief, called a prior. Then we obtain some data and use it to update our belief. The outcome is called a posterior. Should we obtain even more data, the old posterior becomes a new prior and the cycle repeats.

This process employs the **Bayes rule**:

```
P( A | B ) = P( B | A ) * P( A ) / P( B )
```

`P( A | B )`

, read as “probability of A given B”, indicates a conditional probability: how likely is A if B happens.

In Bayesian machine learning we use the Bayes rule to infer model parameters (theta) from data (D):

```
P( theta | D ) = P( D | theta ) * P( theta ) / P( data )
```

All components of this are probability distributions.

`P( data )`

is something we generally cannot compute, but since it’s just a normalizing constant, it doesn’t matter that much. When comparing models, we’re mainly interested in expressions containing theta, because `P( data )`

stays the same for each model.

`P( theta )`

is a prior, or our belief of what the model parameters might be. Most often our opinion in this matter is rather vague and if we have enough data, we simply don’t care. Inference should converge to probable theta as long as it’s not zero in the prior. One specifies a prior in terms of a parametrized distribution - see Where priors come from.

`P( D | theta )`

is called likelihood of data given model parameters. The formula for likelihood is model-specific. People often use likelihood for evaluation of models: a model that gives higher likelihood to real data is better.

Finally, `P( theta | D )`

, a posterior, is what we’re after. It’s a probability distribution over model parameters obtained from prior beliefs and data.

When one uses likelihood to get point estimates of model parameters, it’s called maximum-likelihood estimation, or MLE. If one also takes the prior into account, then it’s maximum a posteriori estimation (MAP). MLE and MAP are the same if the prior is uniform.

Note that choosing a model can be seen as separate from choosing model (hyper)parameters. In practice, though, they are usually performed together, by validation.

Inference refers to how you learn parameters of your model. A model is separate from how you train it, especially in the Bayesian world.

Consider deep learning: you can train a network using Adam, RMSProp or a number of other optimizers. However, they tend to be rather similar to each other, all being variants of Stochastic Gradient Descent. In contrast, Bayesian methods of inference differ from each other more profoundly.

The two most important methods are Monte Carlo sampling and variational inference. Sampling is a gold standard, but slow. The excerpt from The Master Algorithm has more on MCMC.

Variational inference is a method designed explicitly to trade some accuracy for speed. It’s drawback is that it’s model-specific, but there’s light at the end of the tunnel - see the section on software below and Variational Inference: A Review for Statisticians.

In the spectrum of Bayesian methods, there are two main flavours. Let’s call the first *statistical modelling* and the second *probabilistic machine learning*. The latter contains the so-called nonparametric approaches.

Modelling happens when data is scarce and precious and hard to obtain, for example in social sciences and other settings where it is difficult to conduct a large-scale controlled experiment. Imagine a statistician meticulously constructing and tweaking a model using what little data he has. In this setting you spare no effort to make the best use of available input.

Also, with small data it is important to quantify uncertainty and that’s precisely what Bayesian approach is good at.

Bayesian methods - specifically MCMC - are usually computationally costly. This again goes hand-in-hand with small data.

To get a taste, consider examples for the Data Analysis Using Regression Analysis and Multilevel/Hierarchical Models book. That’s a whole book on linear models. They start with a bang: a linear model with no predictors, then go through a number of linear models with one predictor, two predictors, six predictors, up to eleven.

This labor-intensive mode goes against a current trend in machine learning to use data for a computer to learn automatically from it.

Let’s try replacing “Bayesian” with “probabilistic”. From this perspective, it doesn’t differ as much from other methods. As far as classification goes, most classifiers are able to output probabilistic predictions. Even SVMs, which are sort of an antithesis of Bayesian.

By the way, these probabilities are only statements of belief from a classifier. Whether they correspond to real probabilities is another matter completely and it’s called calibration.

Latent Dirichlet Allocation is a method that one throws data at and allows it to sort things out (as opposed to manual modelling). It’s similar to matrix factorization models, especially non-negative MF. You start with a matrix where rows are documents, columns are words and each element is a count of a given word in a given document. LDA “factorizes” this matrix of size *n x d* into two matrices, documents/topics (*n x k*) and topics/words (*k x d*).

The difference from factorization is that you can’t multiply those two matrices to get the original, but since the appropriate rows/columns sum to one, you can “generate” a document. To get the first word, one samples a topic, then a word from this topic (the second matrix). Repeat this for a number of words you want. Notice that this is a bag-of-words representation, not a proper sequence of words.

The above is an example of a **generative** model, meaning that one can sample, or generate examples, from it. Compare with classifiers, which usually model `P( y | x )`

to discriminate between classes based on *x*. A generative model is concerned with joint distribution of *y* and *x*, `P( y, x )`

. It’s more difficult to estimate that distribution, but it allows sampling and of course one can get `P( y | x )`

from `P( y, x )`

.

While there’s no exact definition, the name means that the number of parameters in a model can grow as more data become available. This is similar to Support Vector Machines, for example, where the algorithm chooses support vectors from the training points. Nonparametrics include Hierarchical Dirichlet Process version of LDA, where the number of topics chooses itself automatically, and Gaussian Processes.

Gaussian processes are somewhat similar to Support Vector Machines - both use kernels and have similar scalability (which has been vastly improved throughout the years by using approximations). A natural formulation for GP is regression, with classification as an afterthought. For SVM it’s the other way around.

Another difference is that GP are probabilistic from the ground up (providing error bars), while SVM are not. You can observe this in regression. Most “normal” methods only provide point estimates. Bayesian counterparts, like Gaussian processes, also output uncertainty estimates.

Credit: Yarin Gal’s Heteroscedastic dropout uncertainty
and What my deep model doesn’t know

Unfortunately, it’s not the end of the story. Even a sophisticated method like GP normally operates on an assumption of homoscedasticity, that is, uniform noise levels. In reality, noise might differ across input space (be heteroscedastic) - see the image below.

A relatively popular application of Gaussian Processes is hyperparameter optimization for machine learning algorithms. The data is small, both in dimensionality - usually only a few parameters to tweak, and in the number of examples. Each example represents one run of the target algorithm, which might take hours or days. Therefore we’d like to get to the good stuff with as few examples as possible.

Most of the research on GP seems to happen in Europe. English have done some interesting work on making GP easier to use, culminating in the automated statistician, a project led by Zoubin Ghahramani.

Watch the first 10 minutes of this video for an accessible intro to Gaussian Processes.

The most conspicuous piece of Bayesian software these days is probably Stan. Stan is a probabilistic programming language, meaning that it allows you to specify and train whatever Bayesian models you want. It runs in Python, R and other languages. Stan has a modern sampler called NUTS:

Most of the computation [in Stan] is done using Hamiltonian Monte Carlo. HMC requires some tuning, so Matt Hoffman up and wrote a new algorithm, Nuts (the “No-U-Turn Sampler”) which optimizes HMC adaptively. In many settings, Nuts is actually more computationally efficient than the optimal static HMC!

One especially interesting thing about Stan is that it has automatic variational inference:

Variational inference is a scalable technique for approximate Bayesian inference. Deriving variational inference algorithms requires tedious model-specific calculations; this makes it difficult to automate. We propose an automatic variational inference algorithm, automatic differentiation variational inference (ADVI). The user only provides a Bayesian model and a dataset; nothing else.

This technique paves way to applying small-style modelling to at least medium-sized data.

In Python, the most popular package is PyMC. It is not as advanced or polished (the developers seem to be playing catch-up with Stan), but still good. PyMC has NUTS and ADVI - here’s a notebook with a minibatch ADVI example. The software uses Theano as a backend, so it’s faster than pure Python.

Infer.NET is Microsoft’s library for probabilistic programming. It’s mainly available from languages like C# and F#, but apparently can also be called from .NET’s IronPython. Infer.net uses expectation propagation by default.

Besides those, there’s a myriad of packages implementing various flavours of Bayesian computing, from other probabilistic programming languages to specialized LDA implementations. One interesting example is CrossCat:

CrossCat is a domain-general, Bayesian method for analyzing high-dimensional data tables. CrossCat estimates the full joint distribution over the variables in the table from the data, via approximate inference in a hierarchical, nonparametric Bayesian model, and provides efficient samplers for every conditional distribution. CrossCat combines strengths of nonparametric mixture modeling and Bayesian network structure learning: it can model any joint distribution given enough data by positing latent variables, but also discovers independencies between the observable variables.

and BayesDB/Bayeslite from the same people.

To solidify your understanding, you might go through Radford Neal’s tutorial on Bayesian Methods for Machine Learning. It corresponds 1:1 to the subject of this post.

We found Kruschke’s Doing Bayesian Data Analysis, known as the puppy book, most readable. The author goes to great lengths to explain all the ins and outs of modelling.

Statistical rethinking appears to be of the similar kind, but newer. It has examples in R + Stan. The author, Richard McElreath, published a series of lectures on YouTube.

In terms of machine learning, both books only only go as far as linear models. Likewise, Cam Davidson-Pylon’s Probabilistic Programming & Bayesian Methods for Hackers covers the *Bayesian* part, but not the *machine learning* part.

The same goes to Alex Etz’ series of articles on understanding Bayes.

For those mathematically inclined, Machine Learning: a Probabilistic Perspective by Kevin Murphy might be a good book to check out. You like hardcore? No problemo, Bishop’s Pattern Recognition and Machine Learning got you covered. One recent Reddit thread briefly discusses these two.

Bayesian Reasoning and Machine Learning by David Barber is also popular, and freely available online, as is Gaussian Processes for Machine Learning, the classic book on the matter.

As far as we know, there’s no MOOC on Bayesian machine learning, but *mathematicalmonk* explains machine learning from the Bayesian perspective.

Stan has an extensive manual, PyMC a tutorial and quite a few examples.

]]>Many data science competitions suffer from the test set being markedly different from a training set (a violation of the “identically distributed” assumption). It is then difficult to make a representative validation set. We propose a method for selecting training examples most similar to test examples and using them as a validation set. The core of this idea is training a probabilistic classifier to distinguish train / test examples.

So you know the Bayes rule. How does it relate to machine learning? It can be quite difficult to grasp how the puzzle pieces fit together - we know it took us a while. This article is an introduction we wish we had back then.

For us, there are two major challenges facing deep learning: computational demands and cognitive demands. By cognitive demands we mean that stuff is getting complicated. We take a look at the situation and how people go about dealing with computational demands.

Conformal prediction is related to classifier calibration. The basic premise is that you get guaranteed max. error rate (false negatives, to be exact), and you set that rate as low or as high as you’re willing to tolerate. The catch is, you may get multiple classes assigned to an example: in binary classification, a point can be labelled **both** positive and negative.

The Genentech competition made available rather large data files containing complete medical history of a few million patients. The biggest three were roughly 50GB on disk and 500 million examples each. How to handle such files, specifically how to run GROUP BY operations? We considered two choices: a relational database or Pandas. We went with Pandas. It didn’t quite work even when using a machine with enough RAM, but we found a way.

Now that you know the options, please cast your vote for what you would like to read about next.

**UPDATE**: Voters clearly seem to prefer an article about Bayesian machine learning, so it’s coming. Posts on the other subjects may appear to, possibly in shorter-than-usual form.

In 2005, Caruana et al. made an empirical comparison of supervised learning algorithms [video]. They included random forests and boosted decision trees and concluded that

With excellent performance on all eight metrics, calibrated boosted trees were the best learning algorithm overall. Random forests are close second.

Let’s note two things here. First, they mention **calibrated** boosted trees, meaning that for probabilistic classification trees needed calibration to be the best. Second, it’s unclear what boosting method the authors used.

In the follow-up study concerning supervised learning in high dimensions the results are similar:

Although there is substantial variability in performance across problems and metrics in our experiments, we can discern several interesting results. First, the results confirm the experiments in (Caruana & Niculescu-Mizil, 2006) where boosted decision trees perform exceptionally well when dimensionality is low. In this study boosted trees are the method of choice for up to about 4000 dimensions. Above that, random forests have the best overall performance.

Ten years later Fernandez-Delgado et al. revisited the topic with the paper titled Do we need hundreds of classifiers to solve real world classification problems? Notably, there were no results for gradient-boosted trees, so we asked the author about it. Here’s the answer, reprinted with permission:

That comment has been issued by other researcher (David Herrington), our response was that we tried GBM (gradient boosting machine) in R directly and via caret, but we achieved errors for problems with more than two[-class] data sets. However, in response to him, we developed further experiments with GBM (using only two-class data sets) achieving good results, even better than random forest but only for two-class data sets. This is the email with the results. I hope they can be useful for you. Best regards!

Dear Prof. Herrington:

I apologize for the delay in the answer to your last email. I have achieved results using gbm, but I was so delayed because I found errors with data sets more than two classes: gbm with caret only worked with two-class data sets, it gives an error with multi-class data sets, the same error as in http://stackoverflow.com/questions/15585501/usage-of-caret-with-gbm-method-for-multiclass-classification.

I tried to run gbm directly in R as tells the previous link, but I also found errors with multi-class data sets. I have been trying to find a program that runs, but I did not get it. I will keep trying, but by now I send to you the results with two classes, comparing both GBM and Random Forests (in caret, i.e., rf_t in the paper). The GBM worked without only for 51 data sets (most of them with two classes, although there are 55 data sets with two classes, so that GBM gave errors in 4 two-class data sets), and the average accuracies are:

rf = 82.30% (+/-15.3), gbm = 83.17% (+/-12.5)

so that GBM is better than rf_t. In the paper, the best classifier for two-class data sets was avNNet_t, with 83.0% accuracy, so that GBM is better on these 51 data sets. Attached I send to you the results of RF and GBM, and the plot with the two accuracies (ordered decreasingly) for the 51 data sets.

The detailed results are available on GitHub.

From the chart it would seem that RF and GBM are very much on par. Our feeling is that GBM offers a bigger edge. For example, in Kaggle competitions XGBoost replaced random forests as a method of choice (where applicable).

If we were to guess, the edge didn’t show in the paper because GBT need way more tuning than random forests. It’s quite time consuming to tune an algorithm to the max for each of the many datasets.

With a random forest, in contrast, the first parameter to select is the number of trees. Easy: the more, the better. That’s because the multitude of trees serves to reduce variance. Each tree fits, or overfits, a part of the training set, and in the end their errors cancel out, at least partially. Random forests do overfit, just compare the error on train and validation sets.

Other parameters you may want to look at are those controlling how big a tree can grow. As mentioned above, averaging predictions from each tree counteracts overfitting, so usually one wants biggish trees.

One such parameter is *min. samples per leaf*. In *scikit-learn*’s RF, it’s value is one by default. Sometimes you can try increasing this value a little bit to get smaller trees and less overfitting. This CoverType benchmark overdoes it, going from 1 to 13 at once. Try 2 or 3 first.

Finally, there’s *max. features to consider*. Once upon a time, we tried tuning that param, to no avail. We suspect that it may have a better effect when dealing with sparse data - it would make sense to try increasing it then.

That’s about it for random forests. With gradient-boosted trees there are so many parameters that it’s a subject for a separate article.

]]>

For one thing, the dataset is very clean and tidy. As we mentioned in the article on the Rossmann competition, most Kaggle offerings have their quirks. Often we were getting an impression that the organizers were making the competition unnecessarily convoluted - apparently against their own interests. It’s rather hard to find a contest where you could just apply whatever methods you fancy, without much data cleaning and feature engineering. In this tournament, you can do exactly that.

The task is binary classification. The dataset is low dimensional (14 continuous variables, one categorical, with cardinality of 23) and has a lot of examples, but not too many - 55k. All you need to do is create a validation set (an indicator column is supplied for that), take care of the categorical variable, and get cracking.

The metric for the competition is AUC. Normally, random predictions result in AUC of 0.5. The current leader scores roughly 0.55, which suggests that the stocks are a hard problem indeed, as our previous investigation indicated.

Well-known, mainstream approaches concentrate on predicting asset volatility instead of prices. Predicting volatility allows to value options using the famous Black-Scholes formula. No doubt there are other techniques, but for obvious reasons people aren’t very forthcoming with publishing them. One insider look confirms that algorithmic learning works and people make tons of money - until the models stop working.

Numerai’s solution to this problem is to crowdsource the construction of models. All they want is predictions.

We have invented regularization techniques that transform the problem of capital allocation into a binary classification problem. (…) Recently, breakthrough developments in encryption have made it possible to conceal information but also preserve structure. (…) We’re buying, regularizing and encrypting all of the financial data in the world and giving it away for free.

Well, you sure can download the dataset without registering. Still no idea what it represents, but it doesn’t stop you from placing on the leaderboard with a good black-box model. From Richard Craib, the Numerai founder:

I worked at a big fund. They wanted to kill me when I proposed running a Kaggle competition. Then I started learning about encryption and quit to start my own Kaggle inspired hedge fund.

Getting back to the comparisons with Kaggle, there are a few more differences about the logistics. More people get the money - the whole top 10. Also, **the payouts will be recurring**. This is good news: if you find yourself near the top of the leaderboard and stay there, the rewards will keep flowing. We hear that they might increase if the Numerai hedge fund goes up.

Let’s dive in, then. We have prepared a few Python scripts that will get you started with validation and prediction.

**UPDATE**: Logistic regression code for march 2016 data.

As we mentioned, each example has a validation flag, because even though the points look independent, the underlying data has a time dimension. The split is set up so that you don’t use data “from the future” in training.

```
d = pd.read_csv( 'numerai_training_data.csv' )
# indices of validation examples
iv = d.validation == 1
val = d[iv].copy()
train = d[~iv].copy()
# no need for the column anymore
train.drop( 'validation', axis = 1 , inplace = True )
val.drop( 'validation', axis = 1 , inplace = True )
```

In our experiments we found that cross-validation produces scores very simlilar to the predefined split, so you don’t have to stick with it.

The next thing to do is encoding the categorical variable. Let’s take a look.

```
In [5]: data.groupby( 'c1' )['c1'].count()
Out[5]:
c1
c1_1 1356
c1_10 3358
c1_11 2339
c1_12 367
c1_13 74
c1_14 5130
c1_15 3180
c1_16 2335
c1_17 1501
c1_18 1552
c1_19 1465
c1_20 2944
c1_21 1671
c1_22 1858
c1_23 2373
c1_24 2236
c1_3 10088
c1_4 2180
c1_5 2640
c1_6 1112
c1_7 1111
c1_8 3182
c1_9 986
Name: c1, dtype: int64
```

We replace the original feature with dummy (indicator) columns:

```
train_dummies = pd.get_dummies( train.c1 )
train_num = pd.concat(( train.drop( 'c1', axis = 1 ), train_dummies ), axis = 1 )
val_dummies = pd.get_dummies( val.c1 )
val_num = pd.concat(( val.drop( 'c1', axis = 1 ), val_dummies ), axis = 1 )
```

Of course it doesn’t hurt to check if the set of unique values is the same in the train and test sets:

```
assert( set( train.c1.unique()) == set( val.c1.unique()))
```

If it weren’t, we could create dummies before splitting the sets.

And we’re done with pre-processing. At least when using trees, which don’t care about column means and variances. For other supervised methods, especially neural networks, we’d probably want to standardize - see the appendix below.

Training a random forest with 1000 trees results in validation AUC of roughly 52%. On the leaderboard, it becomes 51.8%.

Now you can proceed to stack them models like crazy.

**UPDATE**: This tournament also has a nasty quirk - validation scores didn’t reflect the leaderboard score. It resulted in a major re-shuffle in the final standings. Interestingly, seven of the top-10 contenders stayed in top-10, while the rest tumbled down.

Before and after.

Scikit-learn provides a variety of scalers, a row normalizer and other nifty gimmicks. We’re going to try them out with logistic regression. To avoid writing the same thing many times, we first define a function that takes data as input, trains, predicts, evaluates, and returns scores:

```
def train_and_evaluate( y_train, x_train, y_val, x_val ):
lr = LR()
lr.fit( x_train, y_train )
p = lr.predict_proba( x_val )
p_bin = lr.predict( x_val )
acc = accuracy( y_val, p_bin )
auc = AUC( y_val, p[:,1] )
return ( auc, acc )
```

Then it’s time for…

We create a wrapper around `train_and_evaluate()`

that transforms X’s before proceeding. This time we use global data to avoid passing it as arguments each time:

```
def transform_train_and_evaluate( transformer ):
global x_train, x_val, y_train, y_val
x_train_new = transformer.fit_transform( x_train )
x_val_new = transformer.transform( x_val )
return train_and_evaluate( y_train, x_train_new, y_val, x_val_new )
```

Now let’s iterate over transformers:

```
transformers = [ MaxAbsScaler(), MinMaxScaler(), RobustScaler(), StandardScaler(),
Normalizer( norm = 'l1' ), Normalizer( norm = 'l2' ), Normalizer( norm = 'max' ),
PolynomialFeatures() ]
for transformer in transformers:
print transformer
auc, acc = transform_train_and_evaluate( transformer )
print "AUC: {:.2%}, accuracy: {:.2%} \n".format( auc, acc )
```

We can also combine transformers using Pipeline, for example create quadratic features and only then scale:

```
poly_scaled = Pipeline([( 'poly', PolynomialFeatures()), ( 'scaler', MinMaxScaler())])
transformers.append( poly_scaled )
```

The output:

```
No transformation
AUC: 52.67%, accuracy: 52.74%
MaxAbsScaler(copy=True)
AUC: 53.52%, accuracy: 52.46%
MinMaxScaler(copy=True, feature_range=(0, 1))
AUC: 53.52%, accuracy: 52.48%
RobustScaler(copy=True, with_centering=True, with_scaling=True)
AUC: 53.52%, accuracy: 52.45%
StandardScaler(copy=True, with_mean=True, with_std=True)
AUC: 53.52%, accuracy: 52.42%
Normalizer(copy=True, norm='l1')
AUC: 53.16%, accuracy: 53.19%
Normalizer(copy=True, norm='l2')
AUC: 52.92%, accuracy: 53.20%
Normalizer(copy=True, norm='max')
AUC: 53.02%, accuracy: 52.66%
PolynomialFeatures(degree=2, include_bias=True, interaction_only=False)
AUC: 53.25%, accuracy: 52.61%
Pipeline(steps=[
('poly', PolynomialFeatures(degree=2, include_bias=True, interaction_only=False)),
('scaler', MinMaxScaler(copy=True, feature_range=(0, 1)))])
AUC: 53.62%, accuracy: 53.04%
```

It appears that all the pre-processing methods boost AUC, at least in validation. The code is available on GitHub.

]]>

The first thing to realize about TensorFlow is that it’s a low-level library, meaning you’ll be multiplying matrices and vectors. Tensors, if you will. In this respect, it’s very much like Theano.

For those preferring a higher level of abstraction, Keras now works with either Theano or TensorFlow as a backend, so you can compare them directly. Is TF any better than Theano? The annoying compilation step is not as pronounced. Other than that, it’s mostly a matter of taste.

**UPDATE**: Google released Pretty Tensor, a higher-level wrapper for TF, and skflow, a simplified interface mimicking scikit-learn.

Now for the elephant in the room… Soumith’s benchmarks suggest that TensorFlow is rather slow.

And this:

@kastnerkyle any idea how to make it run faster or optimise grads? for pure cpu, the javascript version of mdn runs faster than tensorflow!

— hardmaru (@hardmaru) November 26, 2015

MDN in the tweet stands for mixture density networks.

And Alex Smola’s numbers posted by Xavier Amatriain:

As you can see, at the moment TensorFlow doesn’t look too good compared to popular alternatives.

What’s really interesting about the library is it’s purported ability to use multiple machines. Unfortunately, they didn’t release this part yet. No wonder, distributed is hard. Jeff Dean explains that it’s too intertwined with Google’s internal infrastructure, and says *distributed support is one of the top features they’re prioritizing*.

If you want software that is faster and works in distributed setting *now*, check out MXNet. As a bonus, it has interfaces for other languages, including R and Julia. The people behind MXNet have experience with with neural networks, distributed backends, and have written XGBoost, probably the most popular tool among Kagglers.

To sum up, Google released a solid - but hardly outstanding - library that captured a disproportionately large piece of mindshare. Good for them, fresh hires won’t have to learn a new API.

For more, see the Indico machine learning team’s take on TensorFlow, and maybe TensorFlow Disappoints.

]]>Pandas provides functionality similar to R’s data frame. Data frames are containers for tabular data, including both numbers and strings. Unfortunately, the library is pretty complicated and unintuitive. It’s the kind of software you constanly find yourself referring to Stack Overflow with. Therefore it would be nice to have a mental model of how it works and what to expect of it.

We discovered this model listening to a talk by Wes McKinney, the creator of Pandas. He said that the library started as a **replacement for doing analytics in SQL**.

SQL is a language used for moving data in and out of relational databases such as MySQL, Oracle, PostgreSQL, SQLite etc. It has strong theoretical base called *relational algebra* and is pretty easy to read and write once you get the hang of it.

These days you don’t hear that much about SQL because proven, reliable and mature technology is not a news material. SQL is like Unix: it’s a backbone in its domain.

Want to try SQL? One of the easiest ways is to use SQLite. Contrary to other databases, it doesn’t require installation and uses flat files as storage. You’d suspect it’s much slower or primitive, but no. In fact, it’s one of the finest pieces of software we have seen. It’s small, fast and reliable, and has an extensive test suite. The users include Airbus, Apple, Bosch and a number of other well-know companies.

Pandas can read and write to and from databases, so we can create a database in a few lines of code:

```
import pandas as pd
import sqlite3
train_file = 'data/train.csv'
db_file = 'data/sales.sqlite'
train = pd.read_csv( train_file )
conn = sqlite3.connect( db_file )
train.to_sql( 'train', conn, index = False, if_exists = 'replace' )
```

After that, you can use Pandas or one of the available managers to connect to the database and execute queries.

Pandas’ SQL heritage shows, once you know what to look for. If you’re familiar with SQL, it makes using Pandas easier. We’ll show some operations that could be done with either. For demonstration, we’ll use the data from the Rossmann Store Sales competition.

More often that not, Kaggle competitions have quirks. For example: dirty data, a strange evaluation metric, test set markedly different from a train set, difficulty in constructing a validation set, label leakage, few data points, and so on.

One might consider some of these interesting or spicy; the other just spoil the fun. The fact is, rarely do you come across a contest where you can just go and apply some supervised methods off the bat, without wrestling with superfluous problems. The Rossman competition is a rare exception.

The point is to predict sales in about a thousand stores across Germany. There are roughly million training points and 40k testing points. The training set spans January 2013 through June 2015, and the test set the next three months in 2015.

The data is clean and nice, prepared with German solidity. There are some non-standard things to consider, but they are mostly of the good kind.

We’re dealing with a time dimension. This is a very common problem. How to address it? One can pretend the data is static and use feature engineering to account for time. As the features we could have, for example, binary indicators for the day of week (already provided), the month, perhaps the year. Another possibility is to come up with a model inherently capable of dealing with time series.

Evaluation metric is RMSE, but computed on relative error. For example, when you predict zero for non-zero sales, the error value is one. When you predict twice the actual sales, the error also will be one. In effect, this metric doesn’t favour big stores over small ones, as raw RMSE would.

Shop ID is a categorical variable, and relatively high-dimensional. This might make using our first choice, tree ensembles, difficult. Solution? Employ a method able to deal both with high dimensionality and feature interactions (because we need them). Factorization machines is one such method. Another option: transform data to a lower-dim representation.

Besides usual prices for the 1st, 2nd and 3rd place, there’s an additional prize for the team whose methodology Rossman will choose to implement. We consider this a very welcome improvement, as it addresses some issues with Kaggle we raised a while ago.

Let’s look at a histogram of sales, excluding zeros:

`train.loc[train.Sales > 0, 'Sales'].hist( bins = 30 )`

Should we need normality, we can apply the log-transform:

`np.log( train.loc[train.Sales > 0, 'Sales'] ).hist( bins = 20 )`

The benchmark for the competition predicts sales for any given store as a median of sales from all stores on the same day of the week. This means we GROUP BY the day of the week:

```
medians_by_day = train.groupby( ['DayOfWeek'] )['Sales'].median()
```

The result:

```
In [1]: medians_by_day
Out[1]:
DayOfWeek
1 7310
2 6463
3 6133
4 6020
5 6434
6 5410
7 0
Name: Sales, dtype: int64
```

Here’s the same thing in SQL:

```
SELECT DayOfWeek, MEDIAN( Sales ) FROM train GROUP BY DayOfWeek
```

We prefer the median over the mean because of the metric. Unfortunately, MEDIAN function seems to be missing from the popular databases, so we have to stick with the mean for the purpose of this demonstration:

```
SELECT DayOfWeek, AVG( Sales ) FROM train GROUP BY DayOfWeek
```

By convention, we use uppercase for SQL keywords, even though they are case-insensitive. We’re selecting from a table called *train*, so we’d also have *test*, just like train and test files. Since both contain data with the same structure, in real life they would probably be in one table, but let’s play with two for the sake of analogy.

The organizers decided leave in the days where stores were closed, probably to keep the dates continuous. The sales for these days were zero. Currently we’re not taking this into account, but we probably should:

```
medians_by_day_open = train.groupby( ['DayOfWeek', 'Open'] )['Sales'].median()
In [3]: medians_by_day_open
Out[3]:
DayOfWeek Open
1 0 0
1 7539
2 0 0
1 6502
3 0 0
1 6210
4 0 0
1 6246
5 0 0
1 6580
6 0 0
1 5425
7 0 0
1 6876
Name: Sales, dtype: int64
```

By the way, *medians_by_day_open* returned is a series:

```
In [4]: type( medians_by_day_open )
Out[4]: pandas.core.series.Series
```

We’d get the median for Tuesday/Open the following way:

```
In [5]: medians_by_day_open[2][1]
Out[5]: 6502
```

Note how these numbers are bigger than medians by day only. We get a better estimate by excluding “closed” days, and the easiest way to do this is removing them from the train set:

```
train = train.loc[train.Sales > 0]
DELETE FROM train WHERE Sales = 0
```

Just for completeness, there seems to be a few days where a store was open but no sales occured:

```
In [6]: len( train[( train.Open ) & ( train.Sales == 0 )] )
Out[6]: 54
```

The obvious way to improve on the benchmark is to group not only by day of week, but also by a store. Running the query in Pandas:

```
query = 'SELECT DayOfWeek, Store, AVG( Sales ) AS AvgSales FROM train
GROUP BY DayOfWeek, Store'
res = pd.read_sql( query, conn )
res.head()
Out[2]:
DayOfWeek Store AvgSales
0 1 1 4946.119403
1 1 2 5790.522388
2 1 3 7965.029851
3 1 4 10365.686567
4 1 5 5834.880597
```

Note that we can give an alias to *AVG( Sales )*, which is not a very good name for a column, SELECTing AS.

Even better, we can include other fields, for example *Promo*. We have enough data to get medians for every possible combination of these three variables.

```
medians = train.groupby( ['DayOfWeek', 'Store', 'Promo'] )['Sales'].median()
medians = medians.reset_index()
```

`reset_index()`

converts *medians* from a series to a data frame.

In SQL, we would store the computed means for further use in a so called view. A view is like a virtual table: it shows results from a SELECT query.

```
CREATE VIEW means AS
SELECT DayOfWeek, Store, Promo, AVG( Sales ) AS AvgSales FROM train
GROUP BY DayOfWeek, Store, Promo
```

Data for machine learning quite often comes from relational databases. The tools typically expect 2D data: a table or a matrix. On the other hand, in a database you may have it spread over multiple tables, because it’s more natural to store information that way. For example, there may be one table called *sales* which contains store ID, date and sales on that day. Another table, *stores* would contain store data.

If you’d like to use stores info for predicting sales, you need to merge those two tables so that every sale row contains info about the relevant store. Note that the same piece of store data will repeat across many rows. That’s why it’s in the separate table in the first place.

For now, we have the medians/means and want to produce predictions for the test set. Accordingly, there are two data frames, or tables: test and medians/means. For each row in test, we’d like to pull an appropriate median. This could be done with a loop, or with an `apply()`

function, but there’s a better way.

```
test2 = pd.merge( test, medians, on = ['DayOfWeek', 'Store', 'Promo'], how = 'left' )
```

This will take the Sales column from medians and put it in test so that Sales match DayOfWeek, Store and Promo in test. The operation is known as a JOIN.

```
SELECT test.*, means.AvgSales AS Sales FROM test LEFT JOIN means ON (
test.DayOfWeek = means.DayOfWeek
AND test.Store = means.Store
AND test.Promo = means.Promo )
```

You see that SQL can be a bit verbose.

LEFT JOIN means that we treat the left table (test) as primary: if there’s a row in the left table but no corresponding row to join in the right table, we still keep the row in the results. Sales will be NULL in that case. Contrast this with the default INNER JOIN, where we would discard the row. We want to keep all rows in test, that’s why we use a left join. The resulting frame should have as many rows as the original test:

```
assert( len( test2 ) == len( test ))
```

All that is left is saving the predictions to a file.

```
test2[[ 'Id', 'Sales' ]].to_csv( output_file, index = False )
```

The benchmark scores about 0.19, our solution 0.14, the leaders at the time of writing 0.10.

The script is available on GitHub and at Kaggle Scripts. Did you know that you can run your scripts on Kaggle servers? There are some catches, however. One, the script gets released under Apache license and you can’t delete it. Two, it can run for at most 20 minutes.

If you want more on the subject matter, Greg Reda has an article (and a talk) on translating SQL to Pandas, as well as a general Pandas tutorial. This Pandas + SQLite tutorial digs deeper.

]]>

The most popular option *[in Bayesian inference]*, however, is to drown our sorrows in alcohol, get punch drunk, and stumble around all night. The technical term for this is Markov chain Monte Carlo, or MCMC for short. The “Monte Carlo” part is because the method involves chance, like a visit to the eponymous casino, and the “Markov chain” part is because it involves taking a sequence of steps, each of which depends only on the previous one. The idea in MCMC is to do a random walk, like the proverbial drunkard, jumping from state to state of the network in such a way that, in the long run, the number of times each state is visited is proportional to its probability. We can then estimate the probability of a burglary, say, as the fraction of times we visited a state where there was a burglary.

A “well-behaved” Markov chain converges to a stable distribution, so after a while it always gives approximately the same answers. For example, when you shuffle a deck of cards, after a while all card orders are equally likely, no matter the initial order; so you know that if there are n possible orders, the probability of each one is 1/n. The trick in MCMC is to design a Markov chain that converges to the distribution of our Bayesian network. One easy option is to repeatedly cycle through the variables, sampling each one according to its conditional probability given the state of its neighbors. People often talk about MCMC as a kind of simulation, but it’s not: the Markov chain does not simulate any real process; rather, we concocted it to efficiently generate samples from a Bayesian network, which is itself not a sequential model.

The origins of MCMC go all the way back to the Manhattan Project, when physicists needed to estimate the probability that neutrons would collide with atoms and set off a chain reaction. But in more recent decades, it has sparked such a revolution that it’s often considered one of the most important algorithms of all time. MCMC is good not just for computing probabilities but for integrating any function. Without it, scientists were limited to functions they could integrate analytically, or to well-behaved, low-dimensional integrals they could approximate as a series of trapezoids. With MCMC, they’re free to build complex models, knowing the computer will do the heavy lifting. Bayesians, for one, probably have MCMC to thank for the rising popularity of their methods more than anything else.

On the downside, MCMC is often excruciatingly slow to converge, or fools you by looking like it’s converged when it hasn’t. Real probability distributions are usually very peaked, with vast wastelands of minuscule probability punctuated by sudden Everests. The Markov chain then converges to the nearest peak and stays there, leading to very biased probability estimates. It’s as if the drunkard followed the scent of alcohol to the nearest tavern and stayed there all night, instead of wandering all around the city like we wanted him to. On the other hand, if instead of using a Markov chain we just generated independent samples, like simpler Monte Carlo methods do, we’d have no scent to follow and probably wouldn’t even find that first tavern; it would be like throwing darts at a map of the city, hoping they land smack dab on the pubs.

Inference in Bayesian networks is not limited to computing probabilities. It also includes finding the most probable explanation for the evidence, such as the disease that best explains the symptoms or the words that best explain the sounds Siri heard. This is not the same as just picking the most probable word at each step, because words that are individually likely given their sounds may be unlikely to occur together, as in the “Call the please” example. However, similar kinds of algorithms also work for this task (and they are, in fact, what most speech recognizers use).

Most importantly, inference includes making the best decisions, guided not just by the probabilities of different outcomes but also by the corresponding costs (or utilities, to use the technical term). The cost of ignoring an e-mail from your boss asking you to do something by tomorrow is much greater than the cost of seeing a piece of spam, so often it’s better to let an e-mail through even if it does seem fairly likely to be spam.

*You can read a longer, introductory excerpt in the Salon.*

We approach recommendation as a ranking task, meaning that we’re mainly interested in a relatively few items that we consider most relevant and are going to show to the user. This is known as *top-K* recommendation.

Contrast this with rating prediction, as in the Netflix competition. In 2007, Yehuda Koren - a future winner of the contest - noted that people had doubts about using RMSE as the metric and argued in favor of RMSE, using an ad-hoc ranking measure. Later he did the same thing in the paper titled *Factorization Meets the Neighborhood* [PDF].

There’s only a small step from measuring results with RMSE to optimizing RMSE. In our (very limited) experiments, we found RMSE a poor loss function for ranking. For us, matrix factorization optimized for RMSE did reasonably well when ordering a user’s held-out ratings, but failed completely when choosing recommendations from all the available items.

We think the reason is that the training focused on items with the most ratings, achieving a good fit for those. The items with few ratings don’t mean much in terms of their impact on the loss. As a result, predictions for them will be off, some getting scores much higher, some much lower than actual. The former will show among top recommended items, spoiling the results. Maybe some regularization would help.

In other words, RMSE doesn’t tell a true story, and we need metrics specifically crafted for ranking.

The two most popular ranking metrics are MAP and NDCG. We covered Mean average precision a while ago. NDCG stands for Normalized Discounted Cumulative Gain. The main difference between the two is that MAP assumes binary relevance (an item is either of interest or not), while NDCG allows relevance scores in form of real numbers. The relation is just like with classification and regression.

It is difficult to optimize MAP or NDCG directly, because they are discontinuous and thus non-differentiable. The good news is that Ranking Measures and Loss Functions in Learning to Rank shows that a couple of loss functions used in learning to rank approximate those metrics.

Intimidating as the name might be, the idea behind NDCG is pretty simple. A recommender returns some items and we’d like to compute how good the list is. Each item has a relevance score, usually a non-negative number. That’s *gain*. For items we don’t have user feedback for we usually set the gain to zero.

Now we add up those scores; that’s *cumulative gain*. We’d prefer to see the most relevant items at the top of the list, therefore before summing the scores we divide each by a growing number (usually a logarithm of the item position) - that’s *discounting* - and get a DCG.

DCGs are not directly comparable between users, so we *normalize* them. The worst possible DCG when using non-negative relevance scores is zero. To get the best, we arrange all the items in the test set in the ideal order, take first *K* items and compute DCG for them. Then we divide the raw DCG by this ideal DCG to get NDCG@K, a number between 0 and 1.

You may have noticed that we denote the length of the recommendations list by *K*. It is up to the practitioner to choose this number. You can think of it as an estimate of how many items a user will have attention for, so values like 10 or 50 are common.

Here’s some Python code for computing NDCG, it’s pretty simple.

It is important to note that for our experiments the test set consists of all items outside the train, including those not ranked by the user (as mentioned above in the RMSE discussion). Sometimes people restrict test to the set of user’s held-out ratings, so the recommender’s task is reduced to ordering those relatively few items. This is not a realistic scenario.

Now that’s the gist of it; there is an alternative formulation for DCG. You can also use negative relevance scores. In that case, you might compute the worst possible DCG for normalizing (it will be less than zero), or still use zero as the lower bound, depending on the situation.

There are two kinds of feedback: explicit and implicit. Explicit means that users rate items. Implicit feedback, on the other hand, comes from observing user behaviour. Most often it’s binary: a user clicked a link, watched a video, purchased a product.

Less often implicit feedback comes in form of counts, for example how many times a user listened to a song.

MAP is a metric for binary feedback only, while NDCG can be used in any case where you can assign relevance score to a recommended item (binary, integer or real).

We can divide users (and items) into two groups: those in the training set and those not. Validation scores for the first group correspond to so called *weak generalization*, and for the second to *strong generalization*. In case of weak generalization, each user is in the training set. We take some ratings for training and leave the rest for testing. When assessing strong generalization, a user is either in train or test.

We are mainly interested in the strong generalization, because in real life we’re recommending items to users not present in the training set. We could deal with this by re-training the model, but this is infeasible for real-time recommendations (unless our model happens to use online learning, meaning that it could be updated with new data as it comes). Our working assumption will be using a pre-trained model without updates, so we need a way to account for previously unseen users.

Some algorithms are better suited to this scenario, some worse. For example, people might say that matrix factorization models are unable to provide recommendations for new users. This is not quite true. Take alternating least squares (ALS), for example. This method fits the model by keeping user factors fixed while adjusting item factors, and then keeping item factors fixed while adjusting user factors. This goes on until convergence. At test time, when we have input from a new user, we could keep the item factors fixed and fit the user factors, then proceed with recommendations.

In general, when a predicted rating is a dot product between user and item factors, we can take item factors and solve a system of linear equations to estimate user factors. This amounts to fitting a linear regression model. We’d prefer the number of ratings (examples) be greater than the number of factors, but even when it’s not there is hope, thanks to regularization.

Lack of examples is known as a *cold start* problem: a new visitor has no ratings, so collaborative filtering is of no use for recommendation. Only after we have some feedback we can begin to work with that.

Normally a recommender will perform better with more information - ideally the quality of recommendations should improve as a system sees more ratings from a given user. When evaluating a recommender we’d like to take this dimension into account.

To do so, we repeatedly compute recommendations and NDCG for a given user with one rating in train and the rest in test, with two ratings in train and the rest in test, and so on, up to a number (which we’ll call *L*) or until there are no more ratings in test. Then we plot the results.

On the X axis, the number of ratings in train (L). On the Y axis, mean NDCG@50 across users.

When comparing results from two recommenders, the plot will reveal the difference between them. Either one is better than the other across the board, or at some point the curves intersect.

The intersection offers a possibility of using the combination of the two systems. Initially we employ the first; after acquiring more feedback than the threshold, we switch to the other. Here, blue is better when given a few ratings, but around 50 it levels off. Green gains an upper hand when provided with more ratings.

The scores were computed on a test set consisting of roughly 1000 users - this sample size provides discernible shapes, but still some noise, as you can see from jagged lines.

Should we require a number instead of a plot, we can average the scores across number of ratings available for training. The resulting metric is MANDCG: Mean (between users) Average (between 1…L) NDCG. You can think of it as being proportional to the area under the curve on the plot.

Code for this article is available on GitHub. To run it, you’ll need to supply and plug in your recommender.

]]>

It all started with Andrej Karpathy’s blog post on recurrent neural networks generating text, character by character. This is by no means a new idea - it goes back to 2011 paper by Sutskever, Martens and Hinton on *Generating Text with Recurrent Neural Networks*. See Ilya Sutskever’s page for a PDF, video talk and code. He even set up an online demo, although it’s not very impressive by today’s standards.

Anyway, Andrej has written a very lucid explanation, provided a few examples, and posted his char-rnn Torch code on GitHub. Samim employed it to create synthetic Obama speeches and TED talks.

**UPDATE**: See a couple of RNN-generated TED talks. The clip starts with demonic synthetic voice and Juergen Schmidhuber in the video. RNN decides when the audience laughs.

Other people used the network to compose Mozart style music and Irish folk music. The folk music gets the rhythm and harmony straight and for unsuspecting listener might well pass for a human work.

Note that these attempts worked on musical symbols. Another way is to feed raw audio to a recurrent net and some Stanford students did just that. You’ll find the paper, *GRUV: Algorithmic Music Generation using Recurrent Neural Networks*, among CS224d class reports, and an accompanying video on YouTube. We found the output akin to an old, half-tuned-in radio skipping between stations - but the music sounds very real.

All of the mentioned endavours involve recurrent neural networks (mostly their variants with better memory - LSTM and GRU), which are particularly well suited for modelling sequences. Now, let’s turn to convnets, which are good for…

So far, the most common task for CNNs was object recognition in images. In June, Google made a big splash by showing a few pictures created by an unspecified neural network (details are even scarcer than usual - no paper so far, let alone any code or live demo). The authors call the technique inceptionism. The images share a somewhat nightmarish quality - one commenter had this to say:

It seemed to be a sort of monster, or symbol representing a monster, of a form which only a diseased fancy could conceive. If I say that my somewhat extravagant imagination yielded simultaneous pictures of an octopus, a dragon, and a human caricature, I shall not be unfaithful to the spirit of the thing. A pulpy, tentacled head surmounted a grotesque and scaly body with rudimentary wings; but it was the general outline of the whole which made it most shockingly frightful.

Now take a deep breath and look at it:

The monster has become known as “puppyslug”. But don’t be scared, some creations are way prettier:

More at the inceptionism gallery. Also see this short video.

**UPDATE**: Google has released its inceptionism code under the name deepdream. Simultaneously with Google, J.C. Johnson from Stanford made available his implementation of inceptionism, cnn-vis.

*Deepdream* can produce a few distinct styles of images. What you see most often among people’s creations is dogs, due to a large number of said animals in ImageNet. If you want more variety, see the section on the LSD network below.

The community has embraced *deepdream* wholeheartedly. There are:

- deepdream Reddit thread
- dreamdeeply.com - a site where you can upload your image
- bat-country - a pip-installable version (a few nice examples)
- clouddream - a dockerized version
- DeepDream Animator for making videos

Ayahuasca girls. Credits: reddit; original image

There is a practical side to hallucinating views: one can take a few images and interpolate between them, creating a video. This is described in the DeepStereo: Learning to Predict New Views from the World’s Imagery paper and the YouTube clip shows how it works. While not as entertaining as the inceptionism art, imagine how this method could improve Street View in Google Maps.

Other researchers didn’t stay behind. Soumith Chintala from Facebook came up with a method of creating realistically-looking images. It’s called Eyescream and the Torch code is on GitHub.

The project we find most impressive is the Large Scale Deep Neural Network, created by Jonas Degrave, Sander Dieleman and friends. It’s a convnet running in reverse: instead of producing labels from images, it makes images from labels. This is different from inceptionism, where they use normal pictures as input to the net, which only modifies them.

**You owe it to yourself to watch LSD NN on Twitch**. People suggest classes from ImageNet in the chat and the net dreams about them. Even better, you can suggest a combination of two categories, which makes for some stunning visuals.

**UPDATE**: The authors have released Theano/Lasagne-based code.

If all this whetted your appetite for abstract eye candy, take a look at electric sheep. They are fractal videos, a bit similiar to the LSD network output, and can run as a screensaver. The generating engine is called FLAM.

Interestingly, the sheep are bred to be attractive. Viewers indicate which sheep they like and this input is fed to a genetic algorithm, aiming to create even better offspring. Here are some screenshots; if you are a returning reader, you may recognize one or two images:

Google has more.

]]>Kaggle has a tutorial for this contest which takes you through the popular bag-of-words approach, and a take at *word2vec*. The tutorial hardly represents best practices, most certainly to let the competitors improve on it easily. And that’s what we’ll do.

Validation is a cornerstone of machine learning. That’s because we’re after generalization to the unknown test examples. Usually the only sensible way to assess how a model generalizes is by using validation: either a single training/validation split if you have enough examples, or cross-validation, which is more computationally expensive but a necessity if you have few training points.

A sidenote: in quite a few Kaggle competitions a test set comes from a different distribution than a training set, meaning it’s hard to even make a representative validation set. That’s either a challenge or stupidity, depending on your point of view.

To motivate the need for validation let’s inspect a case of the Baidu team taking part in the ImageNet competition. These guys apparently didn’t know about validation, so they had to resort to evaluating their efforts using the leaderboard. You only get two submissions a week with ImageNet, so they created a number of fake accounts to broaden their bandwidth. Unfortunately, the organizers didn’t like it and it resulted in an embarassment for Baidu.

Our first step is to modify the original tutorial code by enabling validation. Therefore we need to split the training set. Since we have 25k training examples, we will take 5k for testing and leave 20k for training. One way is to split a training file into two - we used the `split.py`

script from phraug2:

```
python split.py train.csv train_v.csv test_v.csv -p 0.8 -r dupa
```

Using a random seed “dupa” for reproducibility. *Dupa* is a Polish codeword for occasions like this. Results we report below are based on this split.

The training set is rather small, so another way is to load the whole training file into memory and split it then, using fine tools that *scikit-learn* provides exactly for this type of thing:

```
from sklearn.cross_validation import train_test_split
train, test = train_test_split( data, train_size = 0.8, random_state = 44 )
```

The scripts we provide use this mechanism instead of separate train/test files, for convenience. We need to use indices cause we’re dealing with Pandas frames, not Numpy arrays:

```
all_i = np.arange( len( data ))
train_i, test_i = train_test_split( all_i, train_size = 0.8, random_state = 44 )
train = data.ix[train_i]
test = data.ix[test_i]
```

The metric for the competition is AUC, which needs probabilities. For some reason the Kaggle tutorial predicts only zeros and ones. This is easy to fix:

```
p = rf.predict_proba( test_x )
auc = AUC( test_y, p[:,1] )
```

And we see that random forest scores roughly 91.9%.

Random forest is a very good, robust and versatile method, however it’s no mystery that for high-dimensional sparse data it’s not a best choice. And BoW representation is a perfect example of sparse and high-d.

We covered *bag of words* a few times before, for example in A bag of words and a nice little network. In that post, we used a neural network for classification, but the truth is that a linear model in all its glorious simplicity is usually the first choice. We’ll use logistic regression, for now leaving hyperparams at their default values.

Validation AUC for logistic regression is 92.8%, and it trains much faster than a random forest. If you’re going to remember only one thing from this article, remember to use a linear model for sparse high-dimensional data such as text as bag-of-words.

TF-IDF stands for “term frequency / inverse document frequency” and is a method for emphasizing words that occur frequently in a given document, while at the same time de-emphasising words that occur frequently in many documents.

Our score with TfidfVectorizer and 20k features was 95.6%, a big improvement.

The author of the Kaggle tutorial felt compelled to remove stopwords from reviews. Stopwords are commonly occuring words, like “this”, “that”, “and”, “so”, “on”. Is it a good decision? We don’t know, we need to check, we have the validation set, remember? Leaving stopwords in scores 92.9% (before TF-IDF).

There is one more important reason against removing stopwords: we’d like to try n-grams, and for n-grams we better leave all the words in place. We covered n-grams before, they are combinations of *n* sequential words, starting with bigrams (two words): “cat ate”, “ate my”, “my precious”, “precious homework”. Trigrams consist of three words: “cat ate my”, “ate my homework”, “my precious homework”; 4-grams of four, and so on.

Why do n-grams work? Consider this phrase: “movie not good”. It has obviously negative sentiment, however if you take each word in separation you won’t detect this. On the opposite, the model will probably learn that “good” is a positive sentiment word, which doesn’t help at all here.

On the other hand, bigrams will do the trick: the model will probably learn that “not good” has a negative sentiment.

To use a more complicated example from Stanford’s sentiment analysis page:

This movie was actually neither that funny, nor super witty.

For this, bigrams will fail with “that funny” and “super witty”. We’d need at least trigrams to catch “neither that funny” and “nor super witty”, however these phrases don’t seem to be too common, so if we’re using a restricted number of features, or regularization, they might not make it into the model. Hence the motivation for a more sophisticated model like a neural network, but we digress.

If computing n-grams sounds a little complicated, *scikit-learn* vectorizers can do it automatically. As can Vowpal Wabbit, but we won’t use Vowpal Wabbit here.

The AUC score with tri-grams is 95.9%.

Each word is a feature: whether it’s present in the document or not (0/1), or how many times it appears (an integer >= 0). We started with the original dimensionality from the tutorial, 5000. This makes sense for a random forest, which as a highly non-linear / expressive / high-variance classifier needs a relatively high ratio of examples to dimensionality. Linear models are less exacting in this respect, they can even work with d >> n.

We found out that if we don’t constrain the dimensionality, we run out of memory, even with such a small dataset. We could afford roughly 40k features on a machine with 12 GB of RAM. More caused swapping.

For starters, we tried 20k features. The logistic regression scores 94.2% (before TF-IDF and n-grams), vs 92.9% with 5k features. More is even better: 96.0 with 30k, 96.3 with 40k (after TF-IDF and ngrams).

To deal with memory issues we could use the hashing vectorizer. However it only scores 93.2% vs 96.3% before, partly because it doesn’t support TF-IDF.

We showed how to improve text classification by:

- making a validation set
- predicting probabilities for AUC
- replacing random forest with a linear model
- weighing words with TF-IDF
- leaving the stopwords in
- adding bigrams or trigrams

The public leadearboard score closely reflects validation score: both are roughly 96.3%. At the time of submission it was good enough for top 20 out of ~500 contenders.

You might remember that we left the logistic regression hyperparams at their default values. Moreover, the vectorizer has its own params, actually more that you would expect. Tweaking both results in a modest improvement, to 96.6%.

Again, code for this article is available on Github.

**UPDATE**: Mesnil, Mikolov,Ranzato and Bengio have a paper on sentiment classification: Ensemble of Generative and Discriminative Techniques for Sentiment Analysis of Movie Reviews (code). They found that a linear model using n-grams outperformed both a recurrent neural network and a linear model using sentence vectors.

However, the dataset they use, the Stanford Large Movie Review Dataset, is small - it has 25000 training examples. Alec Radford says that RNN start to outperform linear models when the number of examples is larger, roughly from 100k to 1M.

Credit: Alec Radford / Indico, Passage example

As to sentence vectors, the authors use them with logistic regression. We’d rather see the 100-d vectors fed to a non-linear model like random forest.

Having done that, we humbly discovered that random forest scores just 85-86% (strange… why?), depending on the number of trees. Logistic regression yields roughly 89% accuracy, exactly as reported in the paper.

By the way, the version of word2vec supplied in the repository can train sentence vectors (presumably the same as paragraph vectors, since the author of both is Tomas Mikolov).

]]>