Introducing phraug

Recently we proposed to pre-process large files line by line. Now it’s time to introduce phraug*, a set of Python scripts based on this idea. The scripts mostly deal with format conversion (CSV, libsvm, VW) and with few other tasks common in machine learning.

With phraug you currently can convert from one format to another:

csv to libsvm
csv to Vowpal Wabbit
libsvm to csv
libsvm to Vowpal Wabbit
tsv to csv

And perform some other file operations:

count lines in a file
sample lines from a file
split a file into two randomly
split a file into a number of similiarly sized chunks
save a continuous subset of lines from a file (for example, first 100)
delete specified columns from a csv file
normalize (shift and scale) columns in a csv file

Basically, there’s always at least one input file and usually one or more output files. An input file always stays unchanged.

If you’re familiar with Unix, you may notice that some of these tasks are easily achieved using command line utilities. For example, you can count lines with wc -l or see a beginning of a file with head. Moreover, there are apps like sed and awk that allow for more complicated operations.

On Windows there are some tools, for example with more you can preview large files, but generally such functionality is mostly lacking. There’s a good option to remedy this: installing Cygwin, which provides all the important command line tools from Unix, including the bash shell.

Still, for things like format conversion Python scripting is a good choice. One reason is that you can easily modify those scripts to suit your needs. We found that each project differs slightly in pre-processing requirements and we usually tweak an existing script or write a new one based on the basic template described in Processing large files, line by line.

The files and usage information are available at Github.

*The name comes from a great book, Made to Stick, by Chip and Dan Heath.

FastML

Machine learning made easy

Introducing phraug

Comments