Machine learning made easy

Introducing phraug

Recently we proposed to pre-process large files line by line. Now it’s time to introduce phraug*, a set of Python scripts based on this idea. The scripts mostly deal with format conversion (CSV, libsvm, VW) and with few other tasks common in machine learning.

With phraug you currently can convert from one format to another:

  • csv to libsvm
  • csv to Vowpal Wabbit
  • libsvm to csv
  • libsvm to Vowpal Wabbit
  • tsv to csv

And perform some other file operations:

  • count lines in a file
  • sample lines from a file
  • split a file into two randomly
  • split a file into a number of similiarly sized chunks
  • save a continuous subset of lines from a file (for example, first 100)
  • delete specified columns from a csv file
  • normalize (shift and scale) columns in a csv file

Basically, there’s always at least one input file and usually one or more output files. An input file always stays unchanged.

If you’re familiar with Unix, you may notice that some of these tasks are easily achieved using command line utilities. For example, you can count lines with wc -l or see a beginning of a file with head. Moreover, there are apps like sed and awk that allow for more complicated operations.

On Windows there are some tools, for example with more you can preview large files, but generally such functionality is mostly lacking. There’s a good option to remedy this: installing Cygwin, which provides all the important command line tools from Unix, including the bash shell.

Still, for things like format conversion Python scripting is a good choice. One reason is that you can easily modify those scripts to suit your needs. We found that each project differs slightly in pre-processing requirements and we usually tweak an existing script or write a new one based on the basic template described in Processing large files, line by line.

The files and usage information are available at Github.

*The name comes from a great book, Made to Stick, by Chip and Dan Heath.