Basic seq2seq is an LSTM encoder coupled with an LSTM decoder. It’s most often heard of in the context of machine translation: given a sentence in one language, the encoder turns it into a fixed-size representation. Decoder transforms this into a sentence again, possibly of different length than the source. For example, “como estas?” - two words - would be translated to “how are you?” - three words.

The model gives better results when augmented with attention. Practically it means that instead of processing the input from start to finish, the decoder can look back and forth over input. Specifically, it has access to encoder states from each step, not just the last one. Consider how it may help with Spanish, in which adjectives go before nouns: “neural network” becomes “red neuronal”.

In technical terms, attention (at least this particular kind, content-based attention) boils down to dot products and weighted averages. In short, a weighted average of encoder states becomes the decoder state. Attention is just the distribution of weights.

Here’s more on seq2seq and attention in Keras.

In pointer networks, attention is even simpler: instead of weighing input elements, it points at them probabilistically. In effect, you get a permutation of inputs. Refer to the paper for details and equations.

Note that one doesn’t need to use all the pointers. For example, given a piece of text, a network could mark an excerpt by pointing at the element where it starts and then at the element where it ends.

Where do we start? Well, how about ordering numbers. In other words, a *deep argsort*:

```
In [3]: np.argsort([ 10, 30, 20 ])
Out[3]: array([0, 2, 1], dtype=int64)
In [4]: np.argsort([ 40, 10, 30, 20 ])
Out[4]: array([1, 3, 2, 0], dtype=int64)
```

Let us dive right in

Surprisingly, the authors don’t pursue the task in the paper. Instead, they use two fancy problems: traveling salesman and convex hull (see READMEs), admittedly with very good results. Why not sort numbers, though?

It turns out that numbers are hard. They address it in the follow-up paper, Order Matters: Sequence to sequence for sets. The main point is, make no mistake, that order matters. Specifically, we’re talking about the order of the input elements. The authors found out that this order influences results very much, which is not what we want. That’s because **in essence we’re dealing with sets as input, not sequences**. Sets don’t have inherent order, so how elements are permuted ideally shouldn’t affect the outcome.

Hence the paper introduces an improved architecture, where they replace the LSTM encoder by a feed-forward network connected to another LSTM. That LSTM is said to run repeateadly in order to produce *an embedding which is permutation invariant to the inputs*. The decoder is the same, a pointer network.

Back to sorting numbers. The longer the sequence, the harder it is to sort. For five numbers, they report an accuracy ranging from 81% to 94%, depending on the model (accuracy here refers to the percentage of correctly sorted sequences). When dealing with 15 numbers, the scores range from 0% to 10%.

In our experiments, we achieved nearly 100% accuracy with 5 numbers. Note that this is “categorical accuracy” as reported by Keras, meaning a percentage of elements in their right places. For example, this example would be 50% accurate - the first two elements are in place, but the last two are swapped:

```
4 3 2 1 -> 3 2 0 1
```

For sequences with eight elements, the categorical accuracy drops to around 33%. We also tried a more challenging task, sorting a set of arrays by their sums:

```
[1 2] [3 4] [2 3] -> 0 2 1
```

The network handles this just as (un)easily as scalar numbers.

One unexpected thing we’ve noticed is that the network tends to duplicate pointers, especially early in training. This is disappointing: apparently it cannot remember what it predicted just a moment ago. “Oh yes, this element is going to be the second, and this next element is going to be the second. The next element, let’s see… It’s going to be the second, and the next…”

```
y_test: [2 0 1 4 3]
p: [2 2 2 2 2]
```

Men gathered to visualize outputs of a pointer network in the early stage of training. No smiles at this point.

Later:

```
y_test: [2 0 1 4 3]
p: [2 0 2 4 3]
```

Also, training sometimes gets stuck at some level of accuracy. And a network trained on small numbers doesn’t generalize to bigger ones, like these:

```
981,66,673
856,10,438
884,808,241
```

To help the network with numbers, we tried adding an ID (1,2,3…) to each element of the sequence. The hypothesis was that since the attention is content-based, maybe it could use positions explicitly encoded in content. This ID is either a number (`train_with_positions.py`

) or a one-hot vector (`train_with_positions_categorical.py`

). It seems to help a little, but doesn’t remove the fundamental difficulty.

Code for the experiments is available at GitHub. Compared with the original repo, we added a data generation script and changed the training script to load data from generated files. We also changed optimization algorithm to RMSProp, as it seems to converge reasonably well while handling the learning rate automatically.

We’re representing data with 3D arrays here. The first dimension (rows) is examples, as usual. The second, columns, would normally be features (attributes), but with sequences the features go into the third dimension. The second dimension consists of elements of a given sequence. Below are three example sequences, each with three elements (steps), each with two features:

```
array([[[ 8, 2],
[ 3, 3],
[10, 3]],
[[ 1, 4],
[19, 12],
[ 4, 10]],
[[19, 0],
[15, 12],
[ 8, 6]],
```

The goal would be to sort the elements by the sum of the features, so the corresponding targets would be

```
array([[1, 0, 2],
[0, 2, 1],
[2, 0, 1],
```

And they are encoded categorically:

```
array([[[ 0., 1., 0.],
[ 1., 0., 0.],
[ 0., 0., 1.]],
[[ 1., 0., 0.],
[ 0., 0., 1.],
[ 0., 1., 0.]],
[[ 0., 0., 1.],
[ 1., 0., 0.],
[ 0., 1., 0.]],
```

One hairy thing here is that we’ve been talking all along how recurrent networks can handle variable length sequences, but in practice data is 3D arrays, as seen above. In other words, the sequence length is fixed.

Incredulous cat. What did you think?

The way to deal with that is to fix the dimensionality at the maximum possible sequence length and pad the unused places with zeros.

“Great”, you say, “but won’t it mess up the cost function?” It might, therefore we better mask those zeros so they are omitted when calculating loss. In Keras the official way to do this seems to be the embdedding layer. The relevant parameter is *mask_zero*:

mask_zero: Whether or not the input value 0 is a special “padding” value that should be masked out. This is useful when using recurrent layers which may take variable length input. If this is True then all subsequent layers in the model need to support masking or an exception will be raised. If mask_zero is set to True, as a consequence, index 0 cannot be used in the vocabulary (input_dim should equal size of vocabulary + 1).

For more on masking, see Variable Sequence Lengths in TensorFlow.

We have used a Keras implementation of pointer networks. There are a few others on GitHub, mostly in Tensorflow. Depending on how you look at it, that’s slightly crazy, as people build everything from the ground up, while one just needs a slight modification of a normal seq2seq with attention. On the other hand, seq2seq with attention hasn’t yet found its way into mainstream and Keras the way some other models did, so it’s still blazing trails.

One problem with all of that is you don’t know if an implementation you’re using is correct. Isn’t the network converging because of the task, the optimization method, or maybe a bug? To be sure, you’d need to read and understand the source code line by line, which is just one step removed from writing it yourself. As OpenAI blog puts it:

Results are tricky to reproduce: performance is very noisy, algorithms have many moving parts which allow for subtle bugs, and many papers don’t report all the required tricks. By releasing known-good implementations (and best practices for creating them), we’d like to ensure that apparent RL advances never are due to comparison with buggy or untuned versions of existing algorithms.

Be wary of non-breaking bugs: when we looked through a sample of ten popular reinforcement learning algorithm reimplementations we noticed that six had subtle bugs found by a community member and confirmed by the author.

Here’s where we are in the grand scheme of things

They’re talking about reinforcement learning, but the quote is widely applicable. Luckily, the official Order Matters implementation will be made available upon the publication of the paper. They promised. In the meantime, we salute you.

- https://github.com/keon/pointer-networks slides
- https://github.com/devsisters/pointer-network-tensorflow
- https://github.com/vshallc/PtrNets
- https://github.com/ikostrikov/TensorFlow-Pointer-Networks
- https://github.com/Chanlaw/pointer-networks
- https://github.com/devnag/tensorflow-pointer-networks | article

But wait, there’s more:

- https://github.com/udibr/pointer-generator
- https://github.com/JerrikEph/SentenceOrdering_PTR
- https://github.com/pradyu1993/seq2set-keras

The data consists of a few years’ worth of measurements from nine regions of England. Our first move was to split the training set in two parts for validation, then run gradient boosting and score the predictions. RMSE 0.29 - excellent, that would be top score on the leaderboard. Benchmark to beat is linear regression, scoring 0.33.

We had a hunch not to upload these predictions right away. Good hunch, because validation doesn’t quite work in this competition. We learned about this from a thread on the forum. Surfing the forum links, we found out that scikit-learn has functionality for building progressive validation set for time series. But that is not applicable here, because validation doesn’t work.

However, inspecting the content of the model selection module for other goodies, in the “Model validation” section we found an interesting piece: permutation test score.

The function randomly permutes the labels, then trains a model and scores its predictions. After doing this enough times you build a sample and get a p-value.

Initially, we discovered that permuting the labels doesn’t change the validation score. That would suggest that the features are totally uninformative.

The thing is, that experiment was done on the spur of the moment in ipython, and we saved no code. Later we attempted to reproduce this result, but failed. It appears that validation works within the training set, after all. We adapted the example from the documentation for our `permutation_test.py`

and ran 100 permutations.

The score here is negative MSE (negative because the software is maximizing). Random permutations score around 0.29 in terms of RMSE, true labels in training significantly better (P value is 0.0099), around 0.24.

Note that we’ve used cross-validation here, which isn’t good practice for time series. Therefore, we run the test again with our own split:

```
train_i = d.date.dt.year < 2012
test_i = d.date.dt.year == 2012
# ...
score, permutation_scores, pvalue = permutation_test_score(
clf, x, y, scoring = "neg_mean_squared_error",
cv = [[ train_i, test_i ]],
n_permutations = n_permutations )
```

Turns out that in this way we score even better, around 0.21. Notice how random permutation scores also got a little better.

For good measure, here’s the same setup with addition of one-hot region features. These features do not help much but also don’t spoil the score.

Alas, validation scores do not translate into leaderboard scores.

After the negative experience with validation we formed a suspicion that air quality features may be not quite sufficient for the task at hand. After all, we’d think that maybe people don’t die because today there’s more smog than yesterday, but rather from exposure over the years. This led us to forecast the target as 1-D time series, without additional features. We used the Prophet package from Facebook.

It works as follows: you give it a dataframe containing the date (*ds*) and target value (*y*) columns to fit, then it predicts *yhat* for a period in the future. The quick start offers a more detailed description of the process. Internally, the library employs two Stan models, which means you need Stan installed (good luck on Windows).

Prophet produced convergence in training, good validation scores, and nice plots. Unfortunately, as mentioned, validation doesn’t really work in this competition, so we’re left with the plots. As you can see, the prediction plot, although not accurate, looks very reasonable:

The components plot offers better information - there is a downward trend in mortality, and we see that mortality is seasonal. People die mostly in winter (around New Year), and on Mondays and Fridays.

Prophet decided to print the names of the months in Polish, so this is your chance to learn: march, may, july, september (take care to pronounce that little box correctly), november.

This precious knowledge allowed us to improve on the linear benchmark by adding features based on date: month and day of week. Both are one-hot encoded, for a total of 12 + 7 = 19 new binary features.

```
train['day_of_week'] = train.date.dt.dayofweek
train['month'] = train.date.dt.month
test['day_of_week'] = test.date.dt.dayofweek
test['month'] = test.date.dt.month
# assuming that train and test sets have identical sets of values of categorical features
train = pd.get_dummies( train, columns = [ 'day_of_week', 'month' ])
test = pd.get_dummies( test, columns = [ 'day_of_week', 'month' ])
```

We don’t bother with validation this time, just go to the leaderboard, and into Top Ten.

When not using tree-based models, it usually makes sense to scale the columns. Here, the features have similar max. values, with the exception of T2M, which is slightly bigger.

```
O3 96.284
PM10 59.801
PM25 45.846
NO2 76.765
T2M 297.209
```

Scaling the features with scikit-learn’s *MinMaxScaler* improves the public score by a tiny 0.00021. We chose this scaler for no particular reason other than that it leaves the one-hot columns as-is. In our experience, the differences between different scalers are minor and we find zeros and ones more aesthetically pleasing than shifted and standardized values.

Let’s inspect the feature weights:

```
In [20]: paste
coefs = zip( x_train.columns, lr.coef_ )
for c, i in sorted( coefs, key = lambda _: _[1] ):
print "{:.1f} {}".format( i, c )
## -- End pasted text --
-1.4 NO2
-0.5 T2M
-0.3 O3
0.0 PM25
0.3 PM10
2421203389791.6 month_8
2421203389791.6 month_7
2421203389791.6 month_9
2421203389791.7 month_6
2421203389791.7 month_11
2421203389791.7 month_10
2421203389791.7 month_5
2421203389791.8 month_3
2421203389791.8 month_4
2421203389791.8 month_2
2421203389791.9 month_12
2421203389792.0 month_1
5373781166879.9 day_of_week_6
5373781166879.9 day_of_week_5
5373781166880.0 day_of_week_3
5373781166880.0 day_of_week_2
5373781166880.0 day_of_week_0
5373781166880.0 day_of_week_1
5373781166880.0 day_of_week_4
```

Looks like day of week and month are infinitely more important than other features. Interestingly, the weights within each group are almost the same. What if we use month and day of week as only features?

Ayayay! The code is available at GitHub.

Since we’re scaling, we can now throw in the year as a feature. In addition to scaling, we could subtract the first year that appears in the data from year values, so that values will start from zero. This second approach works slightly better (0.33496 vs 0.33588 on the public leaderboard), but doesn’t help with the score. It’s a bit strange, because mortality IS falling. Perhaps just not in the test set.

And of course, we tried using region information. There are nine regions, so one-hot encoding them is straightforward:

```
d = pd.get_dummies( d, columns = [ 'region' ])
```

We have 18k data points in the training set (12k after deleting examples with nulls), and only two or three dozen features total. This ratio is very good, especially for a linear model. Still, the leaderboard score with regions is about 0.35 - 0.36, meaning serious overfit. Blast!

As we might have mentioned, it seems to be the experience of other players too that whatever works in validation, doesn’t help with the leaderboard score (at least the public one).

There is a pecularity in the data: For the first five regions, the train set contains data up to the end of 2012. For the last three, it ends with 2011. Region six is very special, for some reason: it has data up to 2012-05-27.

```
In [16]: train.groupby( 'region' ).date.max().sort_index()
Out[16]:
region
E12000001 2012-12-31
E12000002 2012-12-31
E12000003 2012-12-31
E12000004 2012-12-31
E12000005 2012-12-31
E12000006 2012-05-27
E12000007 2011-12-31
E12000008 2011-12-31
E12000009 2011-12-31
```

Where the train set ends, the test set starts:

```
In [18]: test.groupby( 'region' ).date.min().sort_index()
Out[18]:
region
E12000001 2013-01-01
E12000002 2013-01-01
E12000003 2013-01-01
E12000004 2013-01-01
E12000005 2013-01-01
E12000006 2012-05-28
E12000007 2012-01-01
E12000008 2012-01-01
E12000009 2012-01-01
```

This could allow us to predict 2012 test values for the four “earlier” regions using mortality in the first five as features. We leave this as an exercise for you, dear reader.

*This article was sponsored by ECMWF.*

Candidates for tuning with Hyperband include all the SGD derivatives - meaning the whole deep learning - and tree ensembles: gradient boosting, and perhaps to a lesser extent, random forest and extremely randomized trees. In other words, the most important supervised methods in use today.

The idea is to try a large number of random configurations:

while the Bayesian Methods perhaps consistently outperform random sampling, they do so only by a negligible amount. To quantify this idea, we compare to random run at twice the speed which beats the two Bayesian Optimization methods, i.e., running random search for twice as long yields superior results.

TPE in this chart is the *Tree of Parzen Estimators* from HyperOpt

Trying all these configurations takes time. If you ever tuned parameters by hand, you know that for some sets of params, you can tell right from the start that they won’t be good. Still, popular tools take it to the bitter end and run for a prescribed number of iterations to get a score.

To solve this problem, Hyperband runs configs for just an iteration or two at first, to get a taste of how they perform. Then it takes the best performers and runs them longer. Indeed, that’s all Hyperband does: **run random configurations on a specific schedule of iterations per configuration, using earlier results to select candidates for longer runs**.

See the table below for an example of such schedule (the default). It starts with 81 runs, one iteration each. Then the best 27 configurations get three iterations each. Then the best nine get nine, and so on. After all runs are complete, the algorithm returns a best configuration found so far and you can run it all over again.

```
max_iter = 81 s=4 s=3 s=2 s=1 s=0
eta = 3 n_i r_i n_i r_i n_i r_i n_i r_i n_i r_i
B = 5*max_iter --------- --------- --------- --------- ---------
81 1 27 3 9 9 6 27 5 81
27 3 9 9 3 27 2 81
9 9 3 27 1 81
3 27 1 81
1 81
```

The schedule depends on two main parameters, *max_iter* and *eta*. *s* is derived from these two and dictates the number of rounds. As you can see, the authors arranged things so that there are no floats, only integers - reportedly based on masonic numerology (you know, *ordo ab chao*). They don’t mention it overtly and that’s the way it should be - some things must be kept confidential (then again, some don’t). But you surely wouldn’t perform the level-III underhand “pull-tap-pull” handshake in front of the cameras, would you? Reckless!

Should you choose other values for these params, there will be fractionals in the number of iterations, but it’s not a problem, really, because an iteration is not what you think it is.

The term iteration is meant to indicate a single unit of computation (e.g. an iteration could be .5 epochs over the dataset) and without loss of generality min_iter=1. Consequently, the length of an iteration should be chosen to be the minimum amount of computation where different hyperparameter configurations start to separate (or where it is clear that some settings diverge).

For tree-based methods, an iteration will be a number of trees, let’s say 5 or 10.

even if performance after a small number of iterations is very unrepresentative of the configurations absolute performance, its relative performance compared with many alternatives trained with the same number of iterations is roughly maintained.

By the way: in random forest, more trees serve to reduce variance, so this assumption may be slightly less valid than for other methods. With just a few trees, the difference between configurations might just reflect noise. With more trees, true performance reveals itself. This dynamic can possibly offset Hyperband’s hedging.

There are obvious counter-examples; for instance if learning-rate/step-size is a hyperparameter, smaller values will likely appear to perform worse for a small number of iterations but may outperform the pack after a large number of iterations.

Which leads us to the wickest point of the system: experiments involving tuning a learning rate.

Hyperband is not a silver bullet.

In practice, when learning rate is a parameter, Hyperband finds configurations that converge quickly. But by the same token, it’s unlikely to find good “low learning rate with many iterations” combos. If you follow developments on Kaggle, you know that people often run XGBoost with precisely this setup to get the best results.

Even though authors say they address this problem by hedging, in default setup there are only a few configurations running for max. iterations (the last round: 5 x 81). All others are pre-selected with a shorter number of iterations. We think that random search among five configurations is unlikely to hit the best stuff.

If you do not tune learning rate, the Hyperband algorithm makes good sense. On the other hand, one could get rid of the last round because hedging is not necessary. Good configs will be pre-selected earlier, no need for blind random search. So, if you cut out two of the five main loops, you save 40% of time, but only forfeit checking 13 configurations.

Even better, one could discard the last tier (1 x 81, 2 x 81, etc.) in each round, including the last round. This drastically reduces time needed. We provide this option in our code.

The way it works, you give Hyperband two functions: one that returns a random configuration, and one that trains that configuration for a given number of iterations and returns a loss value. We call them `get_params()`

and `try_params()`

respectively.

To define a search space and sample from it, we use hyperopt - no sense in reinventing the wheel. Of course if you don’t like it, you’re free to implement `get_params()`

in any way you choose.

Here’s what a space for GradientBoostingClassifier might look like:

```
space = {
'learning_rate': hp.uniform( 'lr', 0.01, 0.2 ),
'subsample': hp.uniform( 'ss', 0.8, 1.0 ),
'max_depth': hp.quniform( 'md', 2, 10, 1 ),
'max_features': hp.choice( 'mf', ( 'sqrt', 'log2', None )),
'min_samples_leaf': hp.quniform( 'mss', 1, 10, 1 ),
'min_samples_split': hp.quniform( 'mss', 2, 20, 1 )
}
```

The hyperparams are straight from the manual. The distributions (`hp.uniform`

, `hp.quniform`

, `hp.choice`

etc.) are described in detail in the Hyperopt wiki. In short:

```
'learning_rate': hp.uniform( 'lr', 0.01, 0.2 )
```

Learning rate is to be sampled from a uniform distribution. The first argument, `lr`

, is a label, which, frankly, we don’t care about much. Apparently Hyperopt needs them. After a label we say that learning rate can vary from 0.01 to 0.2.

```
'max_features': hp.choice( 'mf', ( 'sqrt', 'log2', None ))
```

Max. features is a categorical variable, and the possible values are ‘sqrt’, ‘log2’, or *None*. In this example *None* stands for “no maximum”, meaning all features.

```
'max_depth': hp.quniform( 'md', 2, 10, 1 )
```

Some variables, like the number of trees, or max. depth of a single tree, are integers, not floats. Therefore we use `hp.quniform`

(*quantized uniform*) with parameter *q* (the last one) = 1. Should we need values like 4, 8, 12, 16, we’d use *q* = 4.

There are other handy distributions, specifically *log uniform*, but these three are the most important.

From our experiments on one dataset, using all features (`max_features = None`

) emerged as a winner. It also looks like increasing *min_samples_leaf* and *min_samples_split*, the params to curb overfitting, might help. They interact with *max_depth*, which works in the other direction. The point is, you don’t need to discover all this by hand - let the computer do the work.

Here’s our parameter space for random forest and/or extremely randomized trees:

```
space = {
'criterion': hp.choice( 'c', ( 'gini', 'entropy' )),
'bootstrap': hp.choice( 'b', ( True, False )),
'class_weight': hp.choice( 'cw', ( 'balanced', 'balanced_subsample', None )),
'max_depth': hp.quniform( 'md', 2, 10, 1 ),
'max_features': hp.choice( 'mf', ( 'sqrt', 'log2', None )),
'min_samples_split': hp.quniform( 'msp', 2, 20, 1 ),
'min_samples_leaf': hp.quniform( 'msl', 1, 10, 1 ),
}
```

That was the hard bit, the rest is really easy. The authors kindly provide source code, which is just a snippet of Python. We build on this piece to provide a fully functional implementation, which you can find at GitHub.

“Band” in the name stands for “bandit”.

]]>Using the function is straightforward - you specify which columns you want encoded and get a dataframe with original columns replaced with one-hot encodings.

```
df_with_dummies = pd.get_dummies( df, columns = cols_to_transform )
```

Naturally, there will be more columns in the new frame. They will have names corresponding to the original column and its values. For example, `car`

will be replaced with `car_Audi`

, `car_BMW`

, `car_Mercedes`

etc.

What if the test set is small and some values are absent? Or it has new values not present in the training set, for example *Volkswagen*?

Two solutions come to mind. One is two `pd.concat(( train, test ))`

, `get_dummies()`

and then split the set back. If columns sets in train and test differ, you can extract and concatenate just the categorical columns to encode.

Another way is to add the missing columns, filled with zeros, and delete any extra columns. For this to work, one first needs a list of original `columns`

. We pass the frame to fix and a list of columns to the following function:

```
def add_missing_dummy_columns( d, columns ):
missing_cols = set( columns ) - set( d.columns )
for c in missing_cols:
d[c] = 0
```

By default Python passes non-scalar object by reference - meaning the function operates on the original. And so it modifies `d`

in place and returns nothing. This is a matter of taste and can be easily changed.

We also need to remove any extra columns and reorder the remaining ones to match the original setup:

```
def fix_columns( d, columns ):
add_missing_dummy_columns( d, columns )
# make sure we have all the columns we need
assert( set( columns ) - set( d.columns ) == set())
extra_cols = set( d.columns ) - set( columns )
if extra_cols:
print "extra columns:", extra_cols
d = d[ columns ]
return d
```

Now for some informal testing.

```
def fix_columns_test():
n_cols = 4
n_rows = 5
columns = [ "col_{}".format( x ) for x in range( n_cols )]
# create the "new" set of columns
new_columns = columns[:] # copy
new_columns.pop()
new_columns.append( 'col_new' )
# create the "new" dataframe
n = np.random.random(( n_rows, n_cols ))
d = pd.DataFrame( n, columns = new_columns )
print d
print "\n", columns
fixed_d = fix_columns( d.copy(), columns )
print "\n", fixed_d
assert( list( fixed_d.columns ) == columns )
```

By the way, if you happen to be using `get_dummies( ..., drop_first = True )`

, you might want to think over the process described above to make sure everything works as expected.

Bad programmers worry about the code. Good programmers worry about data structures and their relationships. - Linus Torvalds (the creator of Linux)

Computer science is about dealing with complexity. The main means for this is abstraction, meaning building things from smaller blocks. The things themselves then become blocks for building even bigger things.

The basic and most important unit of abstraction is a function. A function is a black box that takes some inputs and returns some outputs. The whole idea is that you don’t need to know how things work inside, all you care about is the interface. Few people know how exactly a car works under the hood. Knowing how to drive it is enough.

The same goes to stuff in machine learning. From a user’s perspective, we would like to know how to format our data for training and then how to get predictions. It seems obvious, but unfortunately, practice shows otherwise. Time and time again we encounter implementations where data is just background for an algorithm.

We dedicated two articles to this very matter: Loading data in Torch (is a mess) and How to get predictions from Pylearn2. We like to think that ease of use, or lack of it, has something to do with long-term popularity, or lack of it, of a given software library.

As a more recent example, let’s look at Phased LSTM. The purpose of the model is to deal with asynchronous time series, where step size, or period between events, might differ. There are at least four implementations at Github, including the official one.

Naturally, since the point is to process irregularly sampled data, the first question would be how to represent such data. As an exercise, go figure this out.

Two of the implementations [1] [2] don’t bother with async inputs at all, they just use MNIST as an example of dealing with long time series - it’s the secondary usage scenario for the model.

The remaining two generate toy data on the fly, as is often the case with code accompanying a paper. In effect, there is no sample to look at. One needs to dig into the code, find the generator and run it to look at some data.

Tell me, Mr Anderson: what good is a program

if you’re unable to run it on your input?

Similarly, getting predictions from a model is often an afterthought. Some authors are content to compute a bunch of metrics and leave it at that. Why would anyone ever want to get actual predictions, right?

Show me your flowchart and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowchart; it’ll be obvious. – Fred Brooks, The Mythical Man-Month

Show me your code and conceal your data structures, and I shall continue to be mystified. Show me your data structures, and I won’t usually need your code; it’ll be obvious. – Eric S. Raymond, The Cathedral and The Bazaar

**UPDATE**: For more, see Engineering is the bottleneck in (deep learning) research, by Denny Britz, and Software engineering vs machine learning concepts, by Paul Mineiro.

The popularity of chatbots is coming from a few sources, apparently. One, they exemplify the AI dream. Two, making a conversational bot is a fun technical challenge. And three: for many, maybe most, businesses, labour constitutes the biggest cost. Therefore corporations salivate at the prospect of exchanging humans, if only in the internet chat channel, for computers.

Image credit: @0x7000

(Un)fortunately, we’re still very, very far - like 25 years - from strong AI, which in practical terms means that chatbots are crap. You won’t have a meaningful conversation with a computer. All it can do in a customer support role is provide some information and maybe sell you something. A chatbot is like an inflatable doll: cheap and available, but not much else.

Source: @sdw

Now let’s step back from this bland perspective and look at two very successful interfaces that involve typing: a Unix shell and google.com. Let us notice that in a way, they are an opposite to chatbots: they don’t pretend to have any intelligence, and while they use something based on natural language (because what else is there?), they are maximally succinct and to the point in terms of human input. For example, you don’t say,

```
$ List all files in this directory
```

You say

```
$ ls
```

Similarly, you are unlikely to have the following dialogue, at least when typing:

```
$ Hey Google, what's the temperature outside?
And where are you located, sir?
$ I'm in Singapore
It's 79 degrees Fahrenheit.
$ In Celcius, dang nabbit!
Sorry sir. It's 26 degrees Celcius.
```

Instead, one types “temperature Singapore” and when presented with a weather dashboard, clicks C. By the way, neither Singapore nor our location uses Fahrenheit scale (only USA and a few banana republics nearby do), so why does Google show us F?

From this little thought experiment we would conclude that the ideal chatbot is, in essence, artificial intelligence using natural language in written form as a communication channel. The critical AI component is just out of reach:

Human level AI is always just 25 years away. Source: PDF

That leaves only wordiness, and few like to type more than needed (except maybe Java programmers and the guy who designed infix operators in R). Speak, yes, but not type. That makes chatbots similar to communism: promising in theory, dismal in practice. If you know any examples to the contrary, we’d be delighted to get to know them.

Here’s Mat Kelcey proving us wrong:

Man, these Australians…

]]>

There’s an interesting story about how Hadley invented all those things. It goes like this. An angel - some say it was a daemon, but don’t believe them, it was an angel - visited our hero in a dream and said: ”**I will give you some ideas that will make you rich and famous - well, rich intellectually and famous in the R community - but there’s a catch. For reasons I won’t disclose, for a mortal like you wouldn’t really understand, you must make the piping operator as bad as you possibly can, but without rendering it outright ridiculous. No more than three chars, and remember - as ugly and hard to type as you can.**”

Hadley agreed and woke up. Being a smart guy that he is, in the morning he constructed an appropriate bundle of characters. Reportedly, the thought process unfolded as follows:

Okay, three characters. Let’s invert that old

`<-`

, and elongate it:`-->`

. Meh, waaay too pretty and you type it all with one hand.

I know…`~~>`

is good.`~>~`

even better. One needs to press tilde key twice, then delete one, go to`>`

, repeat, all with Shift pressed. Sweet. But dang, too pretty. I need ugly.

Think, man, think! Let’s see.`#>#`

. Yeah. Nah, that looks half-reasonable. Wait… wait… yes…`%>%`

. That’s it!

Image credit: Dexter’s Laboratory

And the rest is history:

```
carriers_db2 %>% summarise(delay = mean(arr_delay)) %>% collect()
```

Image credit: Natalie Cooper

But seriously, the man in question says that’s all because infix operators in R must have the form `%something%`

. By the way, Hadley has at least two books online: Advanced R and R for Data Science.

We know of three modules for piping in Pandas: pandas-ply, dplython and dfply. All three use reasonable piping operators.

**pandas-ply**, from Coursera, is the simplest of them and closest to the Pandas spirit. It uses a normal dot for chaining and just adds a few methods to the DataFrame. Here’s their motivating example, adapted from the *dplyr* intro:

```
grouped_flights = flights.groupby(['year', 'month', 'day'])
output = pd.DataFrame()
output['arr'] = grouped_flights.arr_delay.mean()
output['dep'] = grouped_flights.dep_delay.mean()
filtered_output = output[(output.arr > 30) & (output.dep > 30)]
# instead:
(flights
.groupby(['year', 'month', 'day'])
.ply_select(
arr = X.arr_delay.mean(),
dep = X.dep_delay.mean())
.ply_where(X.arr > 30, X.dep > 30))
```

Less typing and no need for intermediate artifacts. Notice how you refer to the transformed dataframe inside the pipeline by X.

**dplython** is closer to *dplyr*. The module provides verbs (functions) similar to the R counterpart, but the pipeline operator is a handsome *>>*, and there’s this nice *diamonds* dataset:

```
(diamonds >>
sample_n(10) >>
arrange(X.carat) >>
select(X.carat, X.cut, X.depth, X.price))
(diamonds >>
mutate(carat_bin=X.carat.round()) >>
group_by(X.cut, X.carat_bin) >>
summarize(avg_price=X.price.mean()))
```

What’s with the outer parens? Is this Lisp or something?

Let us mention that in R, all functions are pipable. In Python, you need to make them pipable. dlpython has a special decorator for it, *@DelayFunction*.

Finally, there is **dfply**, inspired by dlpython, but with even more functions. It appears less mature than the previous two - *pip install* won’t work here. The example shows some means of deleting columns from a frame:

```
diamonds >> drop_endswith('e','y','z') >> head(2)
```

What these modules provide is mostly syntactic sugar, and using them depends on a personal taste. For example, while the pandas-ply flights example above is convincing, is one of these lines better than the others?

```
diamonds.ply_where(X.carat > 4).ply_select('carat', 'cut', 'depth', 'price')
diamonds >> sift(X.carat > 4) >> select(X.carat, X.cut, X.depth, X.price)
diamonds[diamonds.carat > 4]['carat', 'cut', 'depth', 'price']
```

We’d like automatic X in pandas:

```
diamonds[X.carat > 4][X.carat, X.cut, X.depth, X.price]
```

Apparently it was Stefan Milton Bache who invented the piping operator in the package magrittr. He’s Danish; they have the most complicated and difficult language in Europe, except Hungarians. Danish don’t mind such trivial inconveniences as `%>%`

. By the way, there’s more: `%T>%`

, `%$%`

, `%<>%`

.

Neural networks are conceptually simple, and that’s their beauty. A bunch of homogenous, uniform units, arranged in layers, weighted connections between them, and that’s all. At least in theory. Practice turned out to be a bit different. Instead of feature engineering, we now have architecture engineering, as described by Stephen Merrity:

The romanticized description of deep learning usually promises that the days of hand crafted feature engineering are gone - that the models are advanced enough to work this out themselves. Like most advertising, this is simultaneously true and misleading.

Whilst deep learning has simplified feature engineering in many cases, it certainly hasn’t removed it. As feature engineering has decreased, the architectures of the machine learning models themselves have become increasingly more complex. Most of the time, these model architectures are as specific to a given task as feature engineering used to be.

To clarify, this is still an important step. Architecture engineering is more general than feature engineering and provides many new opportunities. Having said that, however, we shouldn’t be oblivious to the fact that where we are is still far from where we intended to be.

Not quite as bad as doings of architecture astronauts, but not too good either.

An example of architecture specific to a given task

How to explain those architectures? Naturally, with a diagram. A diagram will make it all crystal clear.

Let’s first inspect the two most popular types of networks these days, CNN and LSTM. You’ve already seen a convnet diagram, so turning to the iconic LSTM:

It’s easy, just take a closer look:

As they say, in mathematics you don’t understand things, you just get used to them.

Fortunately, there are good explanations, for example Understanding LSTM Networks and Written Memories: Understanding, Deriving and Extending the LSTM.

LSTM still too complex? Let’s try a simplified version, GRU (Gated Recurrent Unit). Trivial, really.

Especially this one, called *minimal GRU*.

Various modifications of LSTM are now common. Here’s one, called deep bidirectional LSTM:

DB-LSTM, PDF

The rest are pretty self-explanatory, too. Let’s start with a combination of CNN and LSTM, since you have both under your belt now:

Convolutional Residual Memory Network, 1606.05262

Dynamic NTM, 1607.00036

Evolvable Neural Turing Machines, PDF

Unsupervised Domain Adaptation By Backpropagation, 1409.7495

Deeply Recursive CNN For Image Super-Resolution, 1511.04491

Recurrent Model Of Visual Attention, 1406.6247

This diagram of multilayer perceptron with synthetic gradients scores high on clarity:

MLP with synthetic gradients, 1608.05343

Every day brings more. Here’s a fresh one, again from Google:

Google’s Neural Machine Translation System, 1609.08144

Drawings from the Neural Network ZOO are pleasantly simple, but, unfortunately, serve mostly as eye candy. For example:

ESM, ESN and ELM

These look like not-fully-connected perceptrons, but are supposed to represent a *Liquid State Machine*, an *Echo State Network*, and an *Extreme Learning Machine*.

How does LSM differ from ESN? That’s easy, it has green neuron with triangles. But how does ESN differ from ELM? Both have blue neurons.

Seriously, while similar, ESN is a recurrent network and ELM is not. And this kind of thing should probably be visible in an architecture diagram.

You haven’t seen anything till you’ve seen A Neural Compiler:

The input of the compiler is a PASCAL Program.

The compiler produces a neural network that computes what is specified by the PASCAL program.

The compiler generates an intermediate code called cellular code.

Weird, huh?

]]>swibe: In traditional convolution layers, the convolution is tied up with cross-channel pooling: for each output channel, a convolution is applied to each input channel and the results are summed together.

This leads to the unfortunate situation where the network may often be repeatedly applying similar filters to each input channel. Storing these filters wastes memory, and applying them repeatedly wastes computation.

It’s possible to instead split the computation into a convolution stage, where multiple filters are applied to each input channel, and a cross-channel pooling stage where the output channels each use whichever intermediate results are of use to them. It allows the filters to be shared across multiple output channels. This reduces their number substantially, allowing for more efficient networks, or larger networks with a similar computational cost.

benanne: This is not a particularly novel idea, so I get the feeling that some references are missing. It’s been available in TensorFlow as `tf.nn.separable_conv2d()`

for a while, and in this presentation Vincent Vanhoucke [slides, video] also discusses it (slide 26 and onwards).

The results seem to be pretty solid though, and the interaction with residual connections probably makes it more practical. It’s a nice way to further increase depth and nonlinearity while keeping the computational cost and risk of overfitting at reasonable levels.

]]>Now, you can use Cubert to make these beauties. However, if you’re more of a do-it-yourself type, here’s a HOWTO.

Let’s say you’ve performed dimensionality reduction with a method of your choosing and have some data points looking like this:

```
cid,x,y,z
1.0,0.131364496515,-0.590685372085,-1.00062387318
-1.0,-1.90206919581,-0.0518527188196,-1.01665336703
1.0,2.29749236265,-0.982830132008,0.0511009011955
```

First goes the class label and then the three dimensions. The software we use, data-projector, needs a JSON file:

```
{"points": [
{"y": "-79.0866574", "x": "-3.15971493", "z": "-98.5084333", "cid": "1.0"},
{"y": "-50.3503514", "x": "-100.0", "z": "-100.0", "cid": "0.0"},
{"y": "-100.0", "x": "100.0", "z": "-0.643983041", "cid": "1.0"}
]}
```

The dimensions in the cube go from -100 to 100, so we rescale the data accordingly:

```
d = pd.read_csv( input_file )
assert set( d.columns ) == set([ 'cid', 'x', 'y', 'z' ])
scaler = MinMaxScaler( feature_range=( -100, 100 ))
d[[ 'x', 'y', 'z' ]] = scaler.fit_transform( d[[ 'x', 'y', 'z' ]])
```

If our labels are in order (starting from 0), we’re ready to save to JSON:

```
d_json = { 'points': json.loads( d.astype( str ).to_json( None, orient= 'records' )) }
json.dump( d_json, open( output_file, 'wb' ))
```

Why the acrobatics in the first line? We could save directly with:

```
d.astype( str ).to_json( output_file, orient = 'records' )
```

The reason is that we need to wrap the data in a dictionary with one key called ‘points’. Therefore, we:

- convert the data frame to json to a string
- load it into a JSON object
- dump the object to a file

The complete code is available at GitHub.

Now move `data.json`

to the `data-projector`

directory and open `index.html`

with your browser. That is, if your browser happens to be Firefox.

If you’re using Chrome, you’ll need to access `index.html`

through HTTP, because apparently Chrome policy doesn’t allow loading data from external files when opening a file from a local disk.

Yasser Souri gives one solution:

- open a console and in the
`data-projector`

directory - type
`python -m SimpleHTTPServer 80`

(assuming python 2.x) - open
*http://localhost*in your browser

The problem with training examples being different from test examples is that validation won’t be any good for comparing models. That’s because validation examples originate in the training set.

We can see this effect when using Numerai data, which comes from financial time series. We first tried logistic regression and got the following validation scores:

```
LR
AUC: 52.67%, accuracy: 52.74%
MinMaxScaler + LR
AUC: 53.52%, accuracy: 52.48%
```

What about a more expressive model, like logistic regression with polynomial features (that is, feature interactions)? They’re easy to create with *scikit-learn*:

```
from sklearn.pipeline import make_pipeline
poly_scaled_lr = make_pipeline( PolynomialFeatures(), MinMaxScaler(), LogisticRegression())
```

This pipeline looked much better in validation than plain logistic regression, and also better than *MinMaxScaler + LR* combo:

```
PolynomialFeatures + MinMaxScaler + LR
AUC: 53.62%, accuracy: 53.04%
```

So that’s a no-brainer, right? Here are the actual leaderboard scores (from the earlier round of the tournament, using AUC):

```
# AUC 0.51706 / LR
# AUC 0.52781 / MinMaxScaler + LR
# AUC 0.51784 / PolynomialFeatures + MinMaxScaler + LR
```

After all, poly features do about as well as plain LR. Scaler + LR seems to be the best option.

We couldn’t tell that from validation, so it appears that we can’t trust it for selecting models and their parameters.

We’d like to have a validation set representative of the Numerai test set. To that end, we’ll take care to select examples for the validation set which are the most similar to the test set.

Specifically, we’ll run the distinguishing classifier in cross-validation mode, to get predictions for all training examples. Then we’ll see which training examples are misclassified as test and use them for validation.

To be more precise, we’ll choose a number of misclassified examples that the model was most certain about. It means that they look like test examples but in reality are training examples.

Numerai data after PCA. Training set in red, test in turquoise. Quite regular, shaped like a sphere…

Or maybe like a cube? Anyway, sets look difficult to separate.

UPDATE:Now you can create 3D visualizations of your own data sets. Visit cubert.fastml.com and upload a CSV or libsvm-formatted file.

First, let’s try training a classifier to tell train from test, just like we did with the Santander data. Mechanics are the same, but instead of 0.5, we get 0.87 AUC, meaning that the model is able to classify the examples pretty well (at least in terms of AUC, which measures ordering/ranking).

By the way, there are only about 50 training examples that random forest misclassifies as test examples (assigning probability greater than 0.5). We work with what we have and mostly care about the order, though.

Cross-validation provides predictions for all the training points. Now we’d like to sort the training points by their estimated probability of being test examples.

```
i = predictions.argsort()
train['p'] = predictions
train_sorted = train.iloc[i]
```

We did the ascending sort, so for validation we take a desired number of examples from the end:

```
val_size = 5000
train = data.iloc[:-val_size]
val = data.iloc[-val_size:]
```

The current evaluation metric for the competition is log loss. We’re not using a scaler with LR anymore because the data is already scaled. We only scale after creating poly features.

```
LR
AUC: 52.54%, accuracy: 51.96%, log loss: 69.22%
Pipeline(steps=[('poly', PolynomialFeatures(degree=2, include_bias=True, interaction_only=False)), ('scaler', MinMaxScaler(copy=True, feature_range=(0, 1)))])
AUC: 52.57%, accuracy: 51.76%, log loss: 69.58%
```

Let us note that differences between models in validation are pretty slim. Even so, the order is correct - we would choose the right model from the validation scores. Here’s the summary of results achieved for the two models:

Validation:

```
# 0.6922 / LR
# 0.6958 / PolynomialFeatures + MinMaxScaler + LR
```

Public leaderboard:

```
# 0.6910 / LR
# 0.6923 / PolynomialFeatures + MinMaxScaler + LR
```

And the private leaderboard at the end of the May round:

```
# 0.6916 / LR
# 0.6954 / PolynomialFeatures + MinMaxScaler + LR
```

As you can see, our improved validation scores translate closely into the private leaderboard scores.

- Train a classifier to identify whether data comes from the train or test set.
- Sort the training data by it’s probability of being in the test set.
- Select the training data most similar to the test data as your validation set.

*(By Jim Fleming)*

Still, it’s a drag to model upper and lower case separately. It adds to dimensionality, and perhaps more importantly, a network gets no clue that ‘a’ and ‘A’ actually represent pretty much the same thing.

The simplest solution is to discard uppercase and just use lowercase. We propose a more elegant way to deal with the two problems mentioned above: **inserting special markers before each uppercase letter**.

```
Hello World -> ^hello ^world
```

The resulting text is still quite readable. Of course you need to make sure there are no carets in your input to start with, but this is a minor matter: one could use any character as a marker, or invent one. Remember that a char is just a sparse vector: we can make it longer by one element, and that abstract element can be our marker.

Gents witnessing the emergence of R33, the very first char-RNN, back in the day

Here’s how to convert mixed-case text:

```
s = 'Hello World'
re.sub( '([A-Z])', '^\\1', s ).lower()
```

What we do is insert a caret befor each uppercase letter and then turn the whole string to lowercase (`\1`

is a *backreference* to a subgroup marked by parens in the first pattern; we need to quote the backslash, hence `\\1`

). An alternative is to perform both operations inside `sub()`

using a function to modify the match and return a replacement:

```
re.sub( '([A-Z])', lambda match: "^" + match.group( 1 ).lower(), s )
```

Should we need to convert stuff back, we’d use a similar construct:

```
s = '^hello ^world'
re.sub( '\^(.)', lambda match: match.group( 1 ).upper(), s )
```

A caret means “start of the line” in a regular expression, so we need to quote it with a backslash.

Does it work? It does. The network is especially quick to learn `. ^`

combo, representing the end of a sentence and an uppercase letter at the beginning of the next one.

The trick described above is meant for text. People have used char-RNNs for modelling other stuff. It is conceivable to use a similar gimmick for source code, or music, for example to insert bar markers - that might help a network to learn the rhytm.

]]>In part one, we inspect the ideal case: training and testing examples coming from the same distribution, so that the validation error should give good estimation of the test error and classifier should generalize well to unseen test examples.

In such situation, if we attempted to train a classifier to distinguish training examples from test examples, it would perform no better than random. This would correspond to ROC AUC of 0.5.

Does it happen in reality? It does, for example in the Santander Customer Satisfaction competition at Kaggle.

We start by setting the labels according to the task. It’s as easy as:

```
train = pd.read_csv( 'data/train.csv' )
test = pd.read_csv( 'data/test.csv' )
train['TARGET'] = 1
test['TARGET'] = 0
```

Then we concatenate both frames and shuffle the examples:

```
data = pd.concat(( train, test ))
data = data.iloc[ np.random.permutation(len( data )) ]
data.reset_index( drop = True, inplace = True )
x = data.drop( [ 'TARGET', 'ID' ], axis = 1 )
y = data.TARGET
```

Finally we create a new train/test split:

```
train_examples = 100000
x_train = x[:train_examples]
x_test = x[train_examples:]
y_train = y[:train_examples]
y_test = y[train_examples:]
```

Come to think of it, there’s a shorter way (no need to shuffle examples beforehand, too):

```
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split( x, y, train_size = train_examples )
```

Now we’re ready to train and evaluate. Here are the scores:

```
# logistic regression / AUC: 49.82%
# random forest, 10 trees / AUC: 50.05%
# random forest, 100 trees / AUC: 49.95%
```

Train and test are like two peas in a pod, like Tweedledum and Tweedledee - indistinguishable to our models.

Below is a 3D interactive visualization of the combined train and test sets, in red and turquoise. They very much overlap. Click the image to view the interactive version (might take a while to load, the data file is ~8MB).

**UPDATE**: The hosting provider which shall remain unnamed has taken down the account with visualizations. We plan to re-create them on Cubert. In the meantime, you can do so yourself.

Santander training set after PCA

UPDATE:It’s a-live! Now you can create 3D visualizations of your own data sets. Visit cubert.fastml.com and upload a CSV or libsvm-formatted file.

Let’s see if validation scores translate into leaderboard scores, then. We train and validate logistic regression and a random forest. LR gets **58.30**% AUC, RF **75.32**% (subject to randomness).

On the private leaderboard LR scores **61.47**% and RF **74.37**%. These numbers correspond pretty well to the validation results.

The code is available at GitHub.

In part two, due in two weeks, we’ll see what we can do when train and test differ.

]]>

Here’s a hint. Think about the following:

- Have you ever seen a photo of Zygmunt?
- Have you ever met Zygmunt at a conference or heard him speak there?
- Have you ever read any papers by Zygmunt?

That’s right. There’s no Zygmunt the Polish economist ever willing to relocate to San Francisco.

And the “we” that we always use in the posts is not majestic plural. **We** are three Chinese PhD students: Ah, Hai, and Wang*.

To keep the style consistent, Ah does most of the writing and all of the editing. He likes to stay on top of state of the art, runs @fastml_extra and can be a bit shy sometimes. Ah has a (no-longer) secret crush on Anima Anandkumar.

Hai doesn’t care about state of the art or any fancy methods. He just wants to get stuff done. He also doesn’t mind things like data wrangling, feature engineering, and plotting. Hai likes to help people, writes most of the code and is the principal author of phraug.

Wang has a confrontative style and strong opinions about everything. He hates big data, Hadoop, Spark, Weka, Java, Facebook, Google, TensorFlow, AI, distributed systems, you name it. One entity Wang tolerates is Kaggle.

Unfortunately, he has the strongest command of spoken English and almost no accent, so if you are one of the few who have spoken to Zygmunt over Skype, sorry - it was Wang.

In case you exchanged any written messages with Zygmunt, it might have been any one of us typing. We usually route our traffic through Poland to keep up appearances.

Why the deception? China, aspiring to the role of global superpower, tends to evoke strong feelings. At the same time, many people in the west unfortunately disregard Chinese scientists and their work. Therefore we adopted a neutral guise so as not to distract readers from the content.

To preserve the cover, we tried to abstain from mentioning Chinese creations. We think we mostly succeeded - we have only written about Extreme Learning Machines, Liblinear, and stuff by DMLC guys: XGBoost and MXNet.

Keeping our mouths shut that way was hard, because Chinese have a lot to offer, both in traditional machine learning and in deep learning - even when counting only those in continental China. There’s been some very advanced research going on, for example on topic models and dimensionality reduction [1] [2] [3] [4] [5]. Ignore it at your own peril.

Let your plans be dark and impenetrable as night, and when you move, fall like a thunderbolt.

–Sun Tzu, The Art of War

*Authors contribute unequally.

]]>

*While we have some grasp on the matter, we’re not experts, so the following might contain inaccuracies or even outright errors. Feel free to point them out, either in the comments or privately.*

In essence, Bayesian means probabilistic. The specific term exists because there are two approaches to probability. Bayesians think of it as a measure of belief, so that probability is subjective and refers to the future.

Frequentists have a different view: they use probability to refer to past events - in this way it’s objective and doesn’t depend on one’s beliefs. The name comes from the method - for example: we tossed a coin 100 times, it came up heads 53 times, so the frequency/probability of heads is 0.53.

For a thorough investigation of this topic and more, refer to Jake VanderPlas’ Frequentism and Bayesianism series of articles.

As Bayesians, we start with a belief, called a prior. Then we obtain some data and use it to update our belief. The outcome is called a posterior. Should we obtain even more data, the old posterior becomes a new prior and the cycle repeats.

This process employs the **Bayes rule**:

```
P( A | B ) = P( B | A ) * P( A ) / P( B )
```

`P( A | B )`

, read as “probability of A given B”, indicates a conditional probability: how likely is A if B happens.

In Bayesian machine learning we use the Bayes rule to infer model parameters (theta) from data (D):

```
P( theta | D ) = P( D | theta ) * P( theta ) / P( data )
```

All components of this are probability distributions.

`P( data )`

is something we generally cannot compute, but since it’s just a normalizing constant, it doesn’t matter that much. When comparing models, we’re mainly interested in expressions containing theta, because `P( data )`

stays the same for each model.

`P( theta )`

is a prior, or our belief of what the model parameters might be. Most often our opinion in this matter is rather vague and if we have enough data, we simply don’t care. Inference should converge to probable theta as long as it’s not zero in the prior. One specifies a prior in terms of a parametrized distribution - see Where priors come from.

`P( D | theta )`

is called likelihood of data given model parameters. The formula for likelihood is model-specific. People often use likelihood for evaluation of models: a model that gives higher likelihood to real data is better.

Finally, `P( theta | D )`

, a posterior, is what we’re after. It’s a probability distribution over model parameters obtained from prior beliefs and data.

When one uses likelihood to get point estimates of model parameters, it’s called maximum-likelihood estimation, or MLE. If one also takes the prior into account, then it’s maximum a posteriori estimation (MAP). MLE and MAP are the same if the prior is uniform.

Note that choosing a model can be seen as separate from choosing model (hyper)parameters. In practice, though, they are usually performed together, by validation.

Inference refers to how you learn parameters of your model. A model is separate from how you train it, especially in the Bayesian world.

Consider deep learning: you can train a network using Adam, RMSProp or a number of other optimizers. However, they tend to be rather similar to each other, all being variants of Stochastic Gradient Descent. In contrast, Bayesian methods of inference differ from each other more profoundly.

The two most important methods are Monte Carlo sampling and variational inference. Sampling is a gold standard, but slow. The excerpt from The Master Algorithm has more on MCMC.

Variational inference is a method designed explicitly to trade some accuracy for speed. It’s drawback is that it’s model-specific, but there’s light at the end of the tunnel - see the section on software below and Variational Inference: A Review for Statisticians.

In the spectrum of Bayesian methods, there are two main flavours. Let’s call the first *statistical modelling* and the second *probabilistic machine learning*. The latter contains the so-called nonparametric approaches.

Modelling happens when data is scarce and precious and hard to obtain, for example in social sciences and other settings where it is difficult to conduct a large-scale controlled experiment. Imagine a statistician meticulously constructing and tweaking a model using what little data he has. In this setting you spare no effort to make the best use of available input.

Also, with small data it is important to quantify uncertainty and that’s precisely what Bayesian approach is good at.

Bayesian methods - specifically MCMC - are usually computationally costly. This again goes hand-in-hand with small data.

To get a taste, consider examples for the Data Analysis Using Regression Analysis and Multilevel/Hierarchical Models book. That’s a whole book on linear models. They start with a bang: a linear model with no predictors, then go through a number of linear models with one predictor, two predictors, six predictors, up to eleven.

This labor-intensive mode goes against a current trend in machine learning to use data for a computer to learn automatically from it.

Let’s try replacing “Bayesian” with “probabilistic”. From this perspective, it doesn’t differ as much from other methods. As far as classification goes, most classifiers are able to output probabilistic predictions. Even SVMs, which are sort of an antithesis of Bayesian.

By the way, these probabilities are only statements of belief from a classifier. Whether they correspond to real probabilities is another matter completely and it’s called calibration.

Latent Dirichlet Allocation is a method that one throws data at and allows it to sort things out (as opposed to manual modelling). It’s similar to matrix factorization models, especially non-negative MF. You start with a matrix where rows are documents, columns are words and each element is a count of a given word in a given document. LDA “factorizes” this matrix of size *n x d* into two matrices, documents/topics (*n x k*) and topics/words (*k x d*).

The difference from factorization is that you can’t multiply those two matrices to get the original, but since the appropriate rows/columns sum to one, you can “generate” a document. To get the first word, one samples a topic, then a word from this topic (the second matrix). Repeat this for a number of words you want. Notice that this is a bag-of-words representation, not a proper sequence of words.

The above is an example of a **generative** model, meaning that one can sample, or generate examples, from it. Compare with classifiers, which usually model `P( y | x )`

to discriminate between classes based on *x*. A generative model is concerned with joint distribution of *y* and *x*, `P( y, x )`

. It’s more difficult to estimate that distribution, but it allows sampling and of course one can get `P( y | x )`

from `P( y, x )`

.

While there’s no exact definition, the name means that the number of parameters in a model can grow as more data become available. This is similar to Support Vector Machines, for example, where the algorithm chooses support vectors from the training points. Nonparametrics include Hierarchical Dirichlet Process version of LDA, where the number of topics chooses itself automatically, and Gaussian Processes.

Gaussian processes are somewhat similar to Support Vector Machines - both use kernels and have similar scalability (which has been vastly improved throughout the years by using approximations). A natural formulation for GP is regression, with classification as an afterthought. For SVM it’s the other way around.

Another difference is that GP are probabilistic from the ground up (providing error bars), while SVM are not. You can observe this in regression. Most “normal” methods only provide point estimates. Bayesian counterparts, like Gaussian processes, also output uncertainty estimates.

Credit: Yarin Gal’s Heteroscedastic dropout uncertainty
and What my deep model doesn’t know

Unfortunately, it’s not the end of the story. Even a sophisticated method like GP normally operates on an assumption of homoscedasticity, that is, uniform noise levels. In reality, noise might differ across input space (be heteroscedastic) - see the image below.

A relatively popular application of Gaussian Processes is hyperparameter optimization for machine learning algorithms. The data is small, both in dimensionality - usually only a few parameters to tweak, and in the number of examples. Each example represents one run of the target algorithm, which might take hours or days. Therefore we’d like to get to the good stuff with as few examples as possible.

Most of the research on GP seems to happen in Europe. English have done some interesting work on making GP easier to use, culminating in the automated statistician, a project led by Zoubin Ghahramani.

Watch the first 10 minutes of this video for an accessible intro to Gaussian Processes.

The most conspicuous piece of Bayesian software these days is probably Stan. Stan is a probabilistic programming language, meaning that it allows you to specify and train whatever Bayesian models you want. It runs in Python, R and other languages. Stan has a modern sampler called NUTS:

Most of the computation [in Stan] is done using Hamiltonian Monte Carlo. HMC requires some tuning, so Matt Hoffman up and wrote a new algorithm, Nuts (the “No-U-Turn Sampler”) which optimizes HMC adaptively. In many settings, Nuts is actually more computationally efficient than the optimal static HMC!

One especially interesting thing about Stan is that it has automatic variational inference:

Variational inference is a scalable technique for approximate Bayesian inference. Deriving variational inference algorithms requires tedious model-specific calculations; this makes it difficult to automate. We propose an automatic variational inference algorithm, automatic differentiation variational inference (ADVI). The user only provides a Bayesian model and a dataset; nothing else.

This technique paves way to applying small-style modelling to at least medium-sized data.

In Python, the most popular package is PyMC. It is not as advanced or polished (the developers seem to be playing catch-up with Stan), but still good. PyMC has NUTS and ADVI - here’s a notebook with a minibatch ADVI example. The software uses Theano as a backend, so it’s faster than pure Python.

**UPDATE**: Edward is a probabilistic programming library built on top of TensorFlow. It features some deep models and appears to be faster than the competition, at least when using a GPU.

Infer.NET is Microsoft’s library for probabilistic programming. It’s mainly available from languages like C# and F#, but apparently can also be called from .NET’s IronPython. Infer.net uses expectation propagation by default.

Besides those, there’s a myriad of packages implementing various flavours of Bayesian computing, from other probabilistic programming languages to specialized LDA implementations. One interesting example is CrossCat:

CrossCat is a domain-general, Bayesian method for analyzing high-dimensional data tables. CrossCat estimates the full joint distribution over the variables in the table from the data, via approximate inference in a hierarchical, nonparametric Bayesian model, and provides efficient samplers for every conditional distribution. CrossCat combines strengths of nonparametric mixture modeling and Bayesian network structure learning: it can model any joint distribution given enough data by positing latent variables, but also discovers independencies between the observable variables.

and BayesDB/Bayeslite from the same people.

To solidify your understanding, you might go through Radford Neal’s tutorial on Bayesian Methods for Machine Learning. It corresponds 1:1 to the subject of this post.

We found Kruschke’s Doing Bayesian Data Analysis, known as the puppy book, most readable. The author goes to great lengths to explain all the ins and outs of modelling.

Statistical rethinking appears to be of the similar kind, but newer. It has examples in R + Stan. The author, Richard McElreath, published a series of lectures on YouTube.

In terms of machine learning, both books only only go as far as linear models. Likewise, Cam Davidson-Pylon’s Probabilistic Programming & Bayesian Methods for Hackers covers the *Bayesian* part, but not the *machine learning* part.

The same goes to Alex Etz’ series of articles on understanding Bayes.

For those mathematically inclined, Machine Learning: a Probabilistic Perspective by Kevin Murphy might be a good book to check out. You like hardcore? No problemo, Bishop’s Pattern Recognition and Machine Learning got you covered. One recent Reddit thread briefly discusses these two.

Bayesian Reasoning and Machine Learning by David Barber is also popular, and freely available online, as is Gaussian Processes for Machine Learning, the classic book on the matter.

As far as we know, there’s no MOOC on Bayesian machine learning, but *mathematicalmonk* explains machine learning from the Bayesian perspective.

Stan has an extensive manual, PyMC a tutorial and quite a few examples.

]]>Many data science competitions suffer from the test set being markedly different from a training set (a violation of the “identically distributed” assumption). It is then difficult to make a representative validation set. We propose a method for selecting training examples most similar to test examples and using them as a validation set. The core of this idea is training a probabilistic classifier to distinguish train / test examples.

So you know the Bayes rule. How does it relate to machine learning? It can be quite difficult to grasp how the puzzle pieces fit together - we know it took us a while. This article is an introduction we wish we had back then.

For us, there are two major challenges facing deep learning: computational demands and cognitive demands. By cognitive demands we mean that stuff is getting complicated. We take a look at the situation and how people go about dealing with computational demands.

Conformal prediction is related to classifier calibration. The basic premise is that you get guaranteed max. error rate (false negatives, to be exact), and you set that rate as low or as high as you’re willing to tolerate. The catch is, you may get multiple classes assigned to an example: in binary classification, a point can be labelled **both** positive and negative.

The Genentech competition made available rather large data files containing complete medical history of a few million patients. The biggest three were roughly 50GB on disk and 500 million examples each. How to handle such files, specifically how to run GROUP BY operations? We considered two choices: a relational database or Pandas. We went with Pandas. It didn’t quite work even when using a machine with enough RAM, but we found a way.

Now that you know the options, please cast your vote for what you would like to read about next.

**UPDATE**: Voters clearly seem to prefer an article about Bayesian machine learning, so it’s coming. Posts on the other subjects may appear to, possibly in shorter-than-usual form.

In 2005, Caruana et al. made an empirical comparison of supervised learning algorithms [video]. They included random forests and boosted decision trees and concluded that

With excellent performance on all eight metrics, calibrated boosted trees were the best learning algorithm overall. Random forests are close second.

Let’s note two things here. First, they mention **calibrated** boosted trees, meaning that for probabilistic classification trees needed calibration to be the best. Second, it’s unclear what boosting method the authors used.

In the follow-up study concerning supervised learning in high dimensions the results are similar:

Although there is substantial variability in performance across problems and metrics in our experiments, we can discern several interesting results. First, the results confirm the experiments in (Caruana & Niculescu-Mizil, 2006) where boosted decision trees perform exceptionally well when dimensionality is low. In this study boosted trees are the method of choice for up to about 4000 dimensions. Above that, random forests have the best overall performance.

Ten years later Fernandez-Delgado et al. revisited the topic with the paper titled Do we need hundreds of classifiers to solve real world classification problems? Notably, there were no results for gradient-boosted trees, so we asked the author about it. Here’s the answer, reprinted with permission:

That comment has been issued by other researcher (David Herrington), our response was that we tried GBM (gradient boosting machine) in R directly and via caret, but we achieved errors for problems with more than two[-class] data sets. However, in response to him, we developed further experiments with GBM (using only two-class data sets) achieving good results, even better than random forest but only for two-class data sets. This is the email with the results. I hope they can be useful for you. Best regards!

Dear Prof. Herrington:

I apologize for the delay in the answer to your last email. I have achieved results using gbm, but I was so delayed because I found errors with data sets more than two classes: gbm with caret only worked with two-class data sets, it gives an error with multi-class data sets, the same error as in http://stackoverflow.com/questions/15585501/usage-of-caret-with-gbm-method-for-multiclass-classification.

I tried to run gbm directly in R as tells the previous link, but I also found errors with multi-class data sets. I have been trying to find a program that runs, but I did not get it. I will keep trying, but by now I send to you the results with two classes, comparing both GBM and Random Forests (in caret, i.e., rf_t in the paper). The GBM worked without only for 51 data sets (most of them with two classes, although there are 55 data sets with two classes, so that GBM gave errors in 4 two-class data sets), and the average accuracies are:

rf = 82.30% (+/-15.3), gbm = 83.17% (+/-12.5)

so that GBM is better than rf_t. In the paper, the best classifier for two-class data sets was avNNet_t, with 83.0% accuracy, so that GBM is better on these 51 data sets. Attached I send to you the results of RF and GBM, and the plot with the two accuracies (ordered decreasingly) for the 51 data sets.

The detailed results are available on GitHub.

From the chart it would seem that RF and GBM are very much on par. Our feeling is that GBM offers a bigger edge. For example, in Kaggle competitions XGBoost replaced random forests as a method of choice (where applicable).

If we were to guess, the edge didn’t show in the paper because GBT need way more tuning than random forests. It’s quite time consuming to tune an algorithm to the max for each of the many datasets.

With a random forest, in contrast, the first parameter to select is the number of trees. Easy: the more, the better. That’s because the multitude of trees serves to reduce variance. Each tree fits, or overfits, a part of the training set, and in the end their errors cancel out, at least partially. Random forests do overfit, just compare the error on train and validation sets.

Other parameters you may want to look at are those controlling how big a tree can grow. As mentioned above, averaging predictions from each tree counteracts overfitting, so usually one wants biggish trees.

One such parameter is *min. samples per leaf*. In *scikit-learn*’s RF, it’s value is one by default. Sometimes you can try increasing this value a little bit to get smaller trees and less overfitting. This CoverType benchmark overdoes it, going from 1 to 13 at once. Try 2 or 3 first.

Finally, there’s *max. features to consider*. Once upon a time, we tried tuning that param, to no avail. We suspect that it may have a better effect when dealing with sparse data - it would make sense to try increasing it then.

That’s about it for random forests. With gradient-boosted trees there are so many parameters that it’s a subject for a separate article.

]]>

For one thing, the dataset is very clean and tidy. As we mentioned in the article on the Rossmann competition, most Kaggle offerings have their quirks. Often we were getting an impression that the organizers were making the competition unnecessarily convoluted - apparently against their own interests. It’s rather hard to find a contest where you could just apply whatever methods you fancy, without much data cleaning and feature engineering. In this tournament, you can do exactly that.

The task is binary classification. The dataset is low dimensional (14 continuous variables, one categorical, with cardinality of 23) and has a lot of examples, but not too many - 55k. All you need to do is create a validation set (an indicator column is supplied for that), take care of the categorical variable, and get cracking.

The metric for the competition is AUC. Normally, random predictions result in AUC of 0.5. The current leader scores roughly 0.55, which suggests that the stocks are a hard problem indeed, as our previous investigation indicated.

Well-known, mainstream approaches concentrate on predicting asset volatility instead of prices. Predicting volatility allows to value options using the famous Black-Scholes formula. No doubt there are other techniques, but for obvious reasons people aren’t very forthcoming with publishing them. One insider look confirms that algorithmic learning works and people make tons of money - until the models stop working.

Numerai’s solution to this problem is to crowdsource the construction of models. All they want is predictions.

We have invented regularization techniques that transform the problem of capital allocation into a binary classification problem. (…) Recently, breakthrough developments in encryption have made it possible to conceal information but also preserve structure. (…) We’re buying, regularizing and encrypting all of the financial data in the world and giving it away for free.

Well, you sure can download the dataset without registering. Still no idea what it represents, but it doesn’t stop you from placing on the leaderboard with a good black-box model. From Richard Craib, the Numerai founder:

I worked at a big fund. They wanted to kill me when I proposed running a Kaggle competition. Then I started learning about encryption and quit to start my own Kaggle inspired hedge fund.

Getting back to the comparisons with Kaggle, there are a few more differences about the logistics. More people get the money - the whole top 10. Also, **the payouts will be recurring**. This is good news: if you find yourself near the top of the leaderboard and stay there, the rewards will keep flowing. We hear that they might increase if the Numerai hedge fund goes up.

Let’s dive in, then. We have prepared a few Python scripts that will get you started with validation and prediction.

**UPDATE**: Logistic regression code for march 2016 data.

As we mentioned, each example has a validation flag, because even though the points look independent, the underlying data has a time dimension. The split is set up so that you don’t use data “from the future” in training.

```
d = pd.read_csv( 'numerai_training_data.csv' )
# indices of validation examples
iv = d.validation == 1
val = d[iv].copy()
train = d[~iv].copy()
# no need for the column anymore
train.drop( 'validation', axis = 1 , inplace = True )
val.drop( 'validation', axis = 1 , inplace = True )
```

In our experiments we found that cross-validation produces scores very simlilar to the predefined split, so you don’t have to stick with it.

The next thing to do is encoding the categorical variable. Let’s take a look.

```
In [5]: data.groupby( 'c1' )['c1'].count()
Out[5]:
c1
c1_1 1356
c1_10 3358
c1_11 2339
c1_12 367
c1_13 74
c1_14 5130
c1_15 3180
c1_16 2335
c1_17 1501
c1_18 1552
c1_19 1465
c1_20 2944
c1_21 1671
c1_22 1858
c1_23 2373
c1_24 2236
c1_3 10088
c1_4 2180
c1_5 2640
c1_6 1112
c1_7 1111
c1_8 3182
c1_9 986
Name: c1, dtype: int64
```

We replace the original feature with dummy (indicator) columns:

```
train_dummies = pd.get_dummies( train.c1 )
train_num = pd.concat(( train.drop( 'c1', axis = 1 ), train_dummies ), axis = 1 )
val_dummies = pd.get_dummies( val.c1 )
val_num = pd.concat(( val.drop( 'c1', axis = 1 ), val_dummies ), axis = 1 )
```

Of course it doesn’t hurt to check if the set of unique values is the same in the train and test sets:

```
assert( set( train.c1.unique()) == set( val.c1.unique()))
```

If it weren’t, we could create dummies before splitting the sets.

And we’re done with pre-processing. At least when using trees, which don’t care about column means and variances. For other supervised methods, especially neural networks, we’d probably want to standardize - see the appendix below.

Training a random forest with 1000 trees results in validation AUC of roughly 52%. On the leaderboard, it becomes 51.8%.

Now you can proceed to stack them models like crazy.

**UPDATE**: This tournament also has a nasty quirk - validation scores didn’t reflect the leaderboard score. It resulted in a major re-shuffle in the final standings. Interestingly, seven of the top-10 contenders stayed in top-10, while the rest tumbled down.

Before and after.

Scikit-learn provides a variety of scalers, a row normalizer and other nifty gimmicks. We’re going to try them out with logistic regression. To avoid writing the same thing many times, we first define a function that takes data as input, trains, predicts, evaluates, and returns scores:

```
def train_and_evaluate( y_train, x_train, y_val, x_val ):
lr = LR()
lr.fit( x_train, y_train )
p = lr.predict_proba( x_val )
p_bin = lr.predict( x_val )
acc = accuracy( y_val, p_bin )
auc = AUC( y_val, p[:,1] )
return ( auc, acc )
```

Then it’s time for…

We create a wrapper around `train_and_evaluate()`

that transforms X’s before proceeding. This time we use global data to avoid passing it as arguments each time:

```
def transform_train_and_evaluate( transformer ):
global x_train, x_val, y_train, y_val
x_train_new = transformer.fit_transform( x_train )
x_val_new = transformer.transform( x_val )
return train_and_evaluate( y_train, x_train_new, y_val, x_val_new )
```

Now let’s iterate over transformers:

```
transformers = [ MaxAbsScaler(), MinMaxScaler(), RobustScaler(), StandardScaler(),
Normalizer( norm = 'l1' ), Normalizer( norm = 'l2' ), Normalizer( norm = 'max' ),
PolynomialFeatures() ]
for transformer in transformers:
print transformer
auc, acc = transform_train_and_evaluate( transformer )
print "AUC: {:.2%}, accuracy: {:.2%} \n".format( auc, acc )
```

We can also combine transformers using Pipeline, for example create quadratic features and only then scale:

```
poly_scaled = Pipeline([( 'poly', PolynomialFeatures()), ( 'scaler', MinMaxScaler())])
transformers.append( poly_scaled )
```

The output:

```
No transformation
AUC: 52.67%, accuracy: 52.74%
MaxAbsScaler(copy=True)
AUC: 53.52%, accuracy: 52.46%
MinMaxScaler(copy=True, feature_range=(0, 1))
AUC: 53.52%, accuracy: 52.48%
RobustScaler(copy=True, with_centering=True, with_scaling=True)
AUC: 53.52%, accuracy: 52.45%
StandardScaler(copy=True, with_mean=True, with_std=True)
AUC: 53.52%, accuracy: 52.42%
Normalizer(copy=True, norm='l1')
AUC: 53.16%, accuracy: 53.19%
Normalizer(copy=True, norm='l2')
AUC: 52.92%, accuracy: 53.20%
Normalizer(copy=True, norm='max')
AUC: 53.02%, accuracy: 52.66%
PolynomialFeatures(degree=2, include_bias=True, interaction_only=False)
AUC: 53.25%, accuracy: 52.61%
Pipeline(steps=[
('poly', PolynomialFeatures(degree=2, include_bias=True, interaction_only=False)),
('scaler', MinMaxScaler(copy=True, feature_range=(0, 1)))])
AUC: 53.62%, accuracy: 53.04%
```

It appears that all the pre-processing methods boost AUC, at least in validation. The code is available on GitHub.

]]>

The first thing to realize about TensorFlow is that it’s a low-level library, meaning you’ll be multiplying matrices and vectors. Tensors, if you will. In this respect, it’s very much like Theano.

For those preferring a higher level of abstraction, Keras now works with either Theano or TensorFlow as a backend, so you can compare them directly. Is TF any better than Theano? The annoying compilation step is not as pronounced. Other than that, it’s mostly a matter of taste.

**UPDATE**: Google released Pretty Tensor, a higher-level wrapper for TF, and skflow, a simplified interface mimicking scikit-learn.

Now for the elephant in the room… Soumith’s benchmarks suggest that TensorFlow is rather slow.

And this:

@kastnerkyle any idea how to make it run faster or optimise grads? for pure cpu, the javascript version of mdn runs faster than tensorflow!

— hardmaru (@hardmaru) November 26, 2015

MDN in the tweet stands for mixture density networks.

And Alex Smola’s numbers posted by Xavier Amatriain:

As you can see, at the moment TensorFlow doesn’t look too good compared to popular alternatives.

What’s really interesting about the library is it’s purported ability to use multiple machines. Unfortunately, they didn’t release this part yet. No wonder, distributed is hard. Jeff Dean explains that it’s too intertwined with Google’s internal infrastructure, and says *distributed support is one of the top features they’re prioritizing*.

If you want software that is faster and works in distributed setting *now*, check out MXNet. As a bonus, it has interfaces for other languages, including R and Julia. The people behind MXNet have experience with with neural networks, distributed backends, and have written XGBoost, probably the most popular tool among Kagglers.

To sum up, Google released a solid - but hardly outstanding - library that captured a disproportionately large piece of mindshare. Good for them, fresh hires won’t have to learn a new API.

For more, see the Indico machine learning team’s take on TensorFlow, and maybe TensorFlow Disappoints.

]]>Pandas provides functionality similar to R’s data frame. Data frames are containers for tabular data, including both numbers and strings. Unfortunately, the library is pretty complicated and unintuitive. It’s the kind of software you constanly find yourself referring to Stack Overflow with. Therefore it would be nice to have a mental model of how it works and what to expect of it.

We discovered this model listening to a talk by Wes McKinney, the creator of Pandas. He said that the library started as a **replacement for doing analytics in SQL**.

SQL is a language used for moving data in and out of relational databases such as MySQL, Oracle, PostgreSQL, SQLite etc. It has strong theoretical base called *relational algebra* and is pretty easy to read and write once you get the hang of it.

These days you don’t hear that much about SQL because proven, reliable and mature technology is not a news material. SQL is like Unix: it’s a backbone in its domain.

Want to try SQL? One of the easiest ways is to use SQLite. Contrary to other databases, it doesn’t require installation and uses flat files as storage. You’d suspect it’s much slower or primitive, but no. In fact, it’s one of the finest pieces of software we have seen. It’s small, fast and reliable, and has an extensive test suite. The users include Airbus, Apple, Bosch and a number of other well-know companies.

Pandas can read and write to and from databases, so we can create a database in a few lines of code:

```
import pandas as pd
import sqlite3
train_file = 'data/train.csv'
db_file = 'data/sales.sqlite'
train = pd.read_csv( train_file )
conn = sqlite3.connect( db_file )
train.to_sql( 'train', conn, index = False, if_exists = 'replace' )
```

After that, you can use Pandas or one of the available managers to connect to the database and execute queries.

Pandas’ SQL heritage shows, once you know what to look for. If you’re familiar with SQL, it makes using Pandas easier. We’ll show some operations that could be done with either. For demonstration, we’ll use the data from the Rossmann Store Sales competition.

More often that not, Kaggle competitions have quirks. For example: dirty data, a strange evaluation metric, test set markedly different from a train set, difficulty in constructing a validation set, label leakage, few data points, and so on.

One might consider some of these interesting or spicy; the other just spoil the fun. The fact is, rarely do you come across a contest where you can just go and apply some supervised methods off the bat, without wrestling with superfluous problems. The Rossman competition is a rare exception.

The point is to predict sales in about a thousand stores across Germany. There are roughly million training points and 40k testing points. The training set spans January 2013 through June 2015, and the test set the next three months in 2015.

The data is clean and nice, prepared with German solidity. There are some non-standard things to consider, but they are mostly of the good kind.

We’re dealing with a time dimension. This is a very common problem. How to address it? One can pretend the data is static and use feature engineering to account for time. As the features we could have, for example, binary indicators for the day of week (already provided), the month, perhaps the year. Another possibility is to come up with a model inherently capable of dealing with time series.

Evaluation metric is RMSE, but computed on relative error. For example, when you predict zero for non-zero sales, the error value is one. When you predict twice the actual sales, the error also will be one. In effect, this metric doesn’t favour big stores over small ones, as raw RMSE would.

Shop ID is a categorical variable, and relatively high-dimensional. This might make using our first choice, tree ensembles, difficult. Solution? Employ a method able to deal both with high dimensionality and feature interactions (because we need them). Factorization machines is one such method. Another option: transform data to a lower-dim representation.

Besides usual prices for the 1st, 2nd and 3rd place, there’s an additional prize for the team whose methodology Rossman will choose to implement. We consider this a very welcome improvement, as it addresses some issues with Kaggle we raised a while ago.

Let’s look at a histogram of sales, excluding zeros:

`train.loc[train.Sales > 0, 'Sales'].hist( bins = 30 )`

Should we need normality, we can apply the log-transform:

`np.log( train.loc[train.Sales > 0, 'Sales'] ).hist( bins = 20 )`

The benchmark for the competition predicts sales for any given store as a median of sales from all stores on the same day of the week. This means we GROUP BY the day of the week:

```
medians_by_day = train.groupby( ['DayOfWeek'] )['Sales'].median()
```

The result:

```
In [1]: medians_by_day
Out[1]:
DayOfWeek
1 7310
2 6463
3 6133
4 6020
5 6434
6 5410
7 0
Name: Sales, dtype: int64
```

Here’s the same thing in SQL:

```
SELECT DayOfWeek, MEDIAN( Sales ) FROM train GROUP BY DayOfWeek
```

We prefer the median over the mean because of the metric. Unfortunately, MEDIAN function seems to be missing from the popular databases, so we have to stick with the mean for the purpose of this demonstration:

```
SELECT DayOfWeek, AVG( Sales ) FROM train GROUP BY DayOfWeek
```

By convention, we use uppercase for SQL keywords, even though they are case-insensitive. We’re selecting from a table called *train*, so we’d also have *test*, just like train and test files. Since both contain data with the same structure, in real life they would probably be in one table, but let’s play with two for the sake of analogy.

The organizers decided leave in the days where stores were closed, probably to keep the dates continuous. The sales for these days were zero. Currently we’re not taking this into account, but we probably should:

```
medians_by_day_open = train.groupby( ['DayOfWeek', 'Open'] )['Sales'].median()
In [3]: medians_by_day_open
Out[3]:
DayOfWeek Open
1 0 0
1 7539
2 0 0
1 6502
3 0 0
1 6210
4 0 0
1 6246
5 0 0
1 6580
6 0 0
1 5425
7 0 0
1 6876
Name: Sales, dtype: int64
```

By the way, *medians_by_day_open* returned is a series:

```
In [4]: type( medians_by_day_open )
Out[4]: pandas.core.series.Series
```

We’d get the median for Tuesday/Open the following way:

```
In [5]: medians_by_day_open[2][1]
Out[5]: 6502
```

Note how these numbers are bigger than medians by day only. We get a better estimate by excluding “closed” days, and the easiest way to do this is removing them from the train set:

```
train = train.loc[train.Sales > 0]
DELETE FROM train WHERE Sales = 0
```

Just for completeness, there seems to be a few days where a store was open but no sales occured:

```
In [6]: len( train[( train.Open ) & ( train.Sales == 0 )] )
Out[6]: 54
```

The obvious way to improve on the benchmark is to group not only by day of week, but also by a store. Running the query in Pandas:

```
query = 'SELECT DayOfWeek, Store, AVG( Sales ) AS AvgSales FROM train
GROUP BY DayOfWeek, Store'
res = pd.read_sql( query, conn )
res.head()
Out[2]:
DayOfWeek Store AvgSales
0 1 1 4946.119403
1 1 2 5790.522388
2 1 3 7965.029851
3 1 4 10365.686567
4 1 5 5834.880597
```

Note that we can give an alias to *AVG( Sales )*, which is not a very good name for a column, SELECTing AS.

Even better, we can include other fields, for example *Promo*. We have enough data to get medians for every possible combination of these three variables.

```
medians = train.groupby( ['DayOfWeek', 'Store', 'Promo'] )['Sales'].median()
medians = medians.reset_index()
```

`reset_index()`

converts *medians* from a series to a data frame.

In SQL, we would store the computed means for further use in a so called view. A view is like a virtual table: it shows results from a SELECT query.

```
CREATE VIEW means AS
SELECT DayOfWeek, Store, Promo, AVG( Sales ) AS AvgSales FROM train
GROUP BY DayOfWeek, Store, Promo
```

Data for machine learning quite often comes from relational databases. The tools typically expect 2D data: a table or a matrix. On the other hand, in a database you may have it spread over multiple tables, because it’s more natural to store information that way. For example, there may be one table called *sales* which contains store ID, date and sales on that day. Another table, *stores* would contain store data.

If you’d like to use stores info for predicting sales, you need to merge those two tables so that every sale row contains info about the relevant store. Note that the same piece of store data will repeat across many rows. That’s why it’s in the separate table in the first place.

For now, we have the medians/means and want to produce predictions for the test set. Accordingly, there are two data frames, or tables: test and medians/means. For each row in test, we’d like to pull an appropriate median. This could be done with a loop, or with an `apply()`

function, but there’s a better way.

```
test2 = pd.merge( test, medians, on = ['DayOfWeek', 'Store', 'Promo'], how = 'left' )
```

This will take the Sales column from medians and put it in test so that Sales match DayOfWeek, Store and Promo in test. The operation is known as a JOIN.

```
SELECT test.*, means.AvgSales AS Sales FROM test LEFT JOIN means ON (
test.DayOfWeek = means.DayOfWeek
AND test.Store = means.Store
AND test.Promo = means.Promo )
```

You see that SQL can be a bit verbose.

LEFT JOIN means that we treat the left table (test) as primary: if there’s a row in the left table but no corresponding row to join in the right table, we still keep the row in the results. Sales will be NULL in that case. Contrast this with the default INNER JOIN, where we would discard the row. We want to keep all rows in test, that’s why we use a left join. The resulting frame should have as many rows as the original test:

```
assert( len( test2 ) == len( test ))
```

All that is left is saving the predictions to a file.

```
test2[[ 'Id', 'Sales' ]].to_csv( output_file, index = False )
```

The benchmark scores about 0.19, our solution 0.14, the leaders at the time of writing 0.10.

The script is available on GitHub and at Kaggle Scripts. Did you know that you can run your scripts on Kaggle servers? There are some catches, however. One, the script gets released under Apache license and you can’t delete it. Two, it can run for at most 20 minutes.

If you want more on the subject matter, Greg Reda has an article (and a talk) on translating SQL to Pandas, as well as a general Pandas tutorial. This Pandas + SQLite tutorial digs deeper.

]]>