FastML

Machine learning made easy

Loading data in Torch (is a mess)

Torch 7 is a GPU accelerated deep learning framework. It had been rather obscure until recent publicity caused by adoption by Facebook and DeepMind. This entirely anecdotal article describes our experiences trying to load some data in Torch. In short: it’s impossible, unless you’re dealing with images.

UPDATE: PyTorch, a Python version of Torch made available in January 2017, seems to solve many problems mentioned in this article.

We had great expectations about Torch. It seemed like a dream come true, especially with endorsement by DeepMind and LeCun’s group at Facebook (the latter includes some of the creators of the framework). The reality turned out to be a little hairier.


Image credit: Adventure Time with Finn and Jake

The Torch tutorial deals with images and random tensors. These only get you so far; we’d like to load some simple, numeric data. Generally it comes in two flavours: dense (few zeros) and sparse (mostly zeros). For dense data CSV is probably the most popular format, for sparse - Libsvm.

What we’d like to achieve is to get data into Torch’s native tensors, suitable for use with various Torch functions. Tensors are analogous to Numpy arrays; they generalize matrices to three and more dimensions, also covering 1D vector case.

Libsvm

We selected the adult dataset for playing. It’s sparse and readily available in Libsvm format. The torch-svm package provides an interface to the Libsvm library and has facilities for loading files in the libsvm format.

th> d = svm.ascread( 'train.txt' )
Reading train.txt   
# of positive samples = 7841
# of negative samples = 24720
# of total    samples = 32561
# of max dimensions   = 123
Min # of dims = 11
Max # of dims = 14

But what you get is not a tensor, or two tensors, like you would get from scikit-learn:

x, y = load_svmlight_file( 'train.txt' )

What you get is this:

th> d

<14 seconds of printout, 32k entries>

th> d[1]
{
  1 : -1
  2 : 
    {
      1 : IntTensor - size: 14
      2 : FloatTensor - size: 14
    }
}

It’s a table where the first entry of each row is a label and the second entry is a table containing indexes and values of non-zero elements.

th> d[1][2][1]
  3
 11
 14
 19
 39
 42
 55
 64
 67
 73
 75
 76
 80
 83
[torch.IntTensor of dimension 14]

th> d[1][2][2]
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
[torch.FloatTensor of dimension 14]

Probably it wouldn’t be that difficult to write some code to convert it to tensors, the point is that nobody has done it before, at least publicly.

By the way, apparently there are no sparse tensors in Torch (only some budding attempts), so there is no computational bonus for using sparse data.

CSV

CSV is the bread and butter of data. All popular environments (Python, R, Matlab, …) have a single function that loads CSV into a table - for example loadtxt() in Numpy or csvread() in Matlab. While one could argue (as people do in the comments) that you can’t put mixed-type data in a tensor, this doesn’t stop Numpy and Matlab from reading CSV into tensor-like, single-type structures. Objection overruled.

We’ll be using this sample file:

col1,col2,col3
4,5,6
7,8,9

Torch has a package named csvigo. And sure, there’s a function for loading and a function for saving. What you get, though, is not a tensor. It’s a table of columns:

th> f = csvigo.load{ path='test.csv' }
<csv>   parsing file: test.csv  
<csv>   tidying up entries  
<csv>   returning tidy table    

th> f
{
  col2 : 
    {
      1 : "5"
      2 : "8"
    }
  col3 : 
    {
      1 : "6"
      2 : "9"
    }
  col1 : 
    {
      1 : "4"
      2 : "7"
    }
}

Access to documentation is a bit difficult, but it turns out that you can load the data into a table of rows using the raw mode:

th> f = csvigo.load{ path = 'test.csv', mode = 'raw' } 
<csv>   parsing file: test.csv  
<csv>   parsing done    

th> f
{
  1 : 
    {
      1 : "col1"
      2 : "col2"
      3 : "col3"
    }
  2 : 
    {
      1 : "4"
      2 : "5"
      3 : "6"
    }
  3 : 
    {
      1 : "7"
      2 : "8"
      3 : "9"
    }
}

Still not a tensor, though. Most probably there’s a simple way to get a tensor from this but at the moment we don’t know it.

Matlab

What about the Matlab format? You can save a .mat file from Python using scipy.io.savemat. And there’s the mattorch package. It has two shortcomings: first, it needs Matlab installed. The second is more serious: mattorch.load() caused a segmentation fault when loading a file saved by scipy.

But wait. There’s fb-mattorch from Facebook, and it doesn’t need Matlab. It needs, however, Facebook’s entire Torch stack, and we’re in no mood for installing it.

HDF5

HDF5 is a binary format from NASA, used for scientific data. It’s popular enough. So far, so good - binary means compact. A tad sophisticated, but simple enough if one doesn’t need advanced functionality. It’s fine if you’re OK with binary.

Guys from DeepMind have provided a Torch package for reading and writing HDF5: torch-hdf5. And it comes with instructions! What’s the catch? Well, we followed the instructions in the manual and after some confusion from trying to get it to work we discovered that the thing we installed is a different package with the same name.

This has been since cleared so HDF5 looks like the best bet so far.

Torch native

Torch has functions for serializing data, meaning you can save a tensor in a binary or text file. Scott Locklin has put together a shell script for converting CSV to Torch format: csv2t7.sh. Basically it slaps a header to the rest, which is just space-separated numbers, and it works.

There’s also this more complicated set of two scripts called csv2torch, which we haven’t tried.

The big picture

For a piece of software in development for a few years now and in version 7 (after 3 and 5), Torch’s utilities for loading data aren’t impressive. In fact, to us they’re downright disappointing. They also give a glimpse of a wider picture, meaning rough edges in other places.

For example, the documentation is sparse and scattered. Installation with curl -s will just fail silently if curl’s handling of SSL certificates is not properly configured, as seems to be the default case on Ubuntu. And things like that.

Perhaps the most obvious difficulty for a newcomer is a new language (Lua), and its strange ecosystem. Lua is simple and quite similiar to Python, however you still need to learn a number of things. For example, how do you print a working directory? How do you change a working directory? Prepare for discovery.

On top of that, there’s the new tensor handling syntax. And it’s not Julia’s syntax, which is almost identical to Matlab (comparison), no sir (or ma’am).

All this is why in our opinion Torch at the moment is most suitable for brave adventurers.

Comments