Processing large files, line by line

Perhaps the most common format of data for machine learning is text files. Often data is too large to fit in memory; this is sometimes referred to as big data. But do you need to load the whole data into memory? Maybe you could at least pre-process it line by line. We show how to do this with Python. Prepare to read and possibly write some code.

The most common format for text files is probably CSV. For sparse data, libsvm format is popular. Both can be processed using csv module in Python.

import csv

i_f = open( input_file, 'r' )
reader = csv.reader( i_f )

For libsvm you just set the delimiter to space:

reader = csv.reader( i_f, delimiter = ' ' )

Then you go over the file contents. Each line is a list of strings:

for line in reader:

    # do something with the line, for example:
    label = float( line[0] )
    # ....

    writer.writerow( line )

If you need to do a second pass, you just rewind the input file:

i_f.seek( 0 )
for line in reader:
    # more stuff

split.py

We’ll demonstrate the idea using the split.py script as an example. You might have seen it already. It’s purpose is to randomly split the file into two, such that some lines from the original file go to the first ouput file and the rest go to the second. This is useful for creating training and testing set for validation.

We’d like to specify input and output files on the command line, like this:

python split.py train.csv train_v.csv test_v.csv

We’d also like to specify the probability of writing to the first file, so that for example 90% go to the train set and the rest to the test set:

python split.py train.csv train_v.csv test_v.csv 0.9

Let’s get to it. We need to import a few modules and read the file names:

import csv
import sys
import random

input_file = sys.argv[1]
output_file1 = sys.argv[2]
output_file2 = sys.argv[3]

If a user doesn’t specify P, it will be 0.9 by default:

try:
    P = float( sys.argv[4] )
except IndexError:
    P = 0.9

Random seed

It might be useful to be able to split the file again in the future in exactly the same way. Let’s say that you have some data and split it. Then you convert the data to some other format and again need to split it the same way so that you can compare the results achieved with different formats but the same split. Otherwise we might be comparing apples to oranges and that’s something we probably don’t want.

To get the same split every time, we’ll seed a random number generator. You give it a seed - an arbitrary string - and it will behave exactly the same every time. That’s because it’s not really random. So we’d like to be able to specify a seed on the command line as the final argument:

python split.py train.csv train_v.csv test_v.csv 0.9 a_random_seed

try:
    seed = sys.argv[5]
except IndexError:
    seed = None

if seed:
    random.seed( seed )

Readers and writers

Let’s open the files and create a CSV reader and two writers. If you’re on Windows, it’s important to open the files for writing in binary mode ('wb'), otherwise you might get some extra new lines.

i = open( input_file )
o1 = open( output_file1, 'wb' )
o2 = open( output_file2, 'wb' )

reader = csv.reader( i )
writer1 = csv.writer( o1 )
writer2 = csv.writer( o2 )

By the way, in this example we aren’t touching the contents of a line, so we don’t really need the csv module. Something like this would be enough:

i = open( input_file )
o = open( output_file, 'wb' )

for line in i:
    o.write( line )

The headers

Some files have headers in the first line. If that’s the case with your data, you can read the first line and write it to both output files:

headers = reader.next()
writer1.writerow( headers )
writer2.writerow( headers )

Or maybe discard it.

The loop

After all this preparation we are ready for the final step: the loop. We read a line and then (pseudo)randomly decide whether to write it to one file or the other. The random() function in the random module returns a floating point number in the range [0.0, 1.0). We compare this number to P and on this basis decide which file to write to.

for line in reader:
    r = random.random()
    if r > P:
        writer2.writerow( line )
    else:
        writer1.writerow( line )

Note that it’s an inexact method of splitting: if you have a thousand lines, you’ll get roughly 900 in the first file and roughly 100 in the second with P = 0.9.

An exact split

If you wanted to split the file exactly 900/100, here’s a way to do it. You shuffle integers from 0 to 999 and take the first 100.

n = 1000
indexes = random.sample( xrange( n ), n )
indexes = indexes[:100]

xrange() is just like range(), only marginally more efficient. Basically it will give you integers from 0 to n-1, which we sample without replacement. In other words, we shuffle the list and divide it into two parts.

Then you do a second pass over the file. If a line’s number is in selected 100, you write to the second file; if not, to the first.

Epilogue

We use the described template extensively. It is usually quite fast, needs a little memory and can deal with files larger than available memory, unlike R or Matlab. If you want more examples, look through our Github repos and be sure to check out phraug.

FastML

Machine learning made easy