FastML

Machine learning made easy

Piping in R and in Pandas

In R community, there’s this one guy, Hadley Wickam, who by himself made R great again. One of the many, many things he came up with - so many they call it a hadleyverse - is the dplyr package, which aims to make data analysis easy and fast. It works by allowing a user to take a data frame and apply to it a pipeline of operations resulting in a desired outcome (an example in just a minute). This approach turned out to be successful. Then people have ported key pieces to Pandas.

There’s an interesting story about how Hadley invented all those things. It goes like this. An angel - some say it was a daemon, but don’t believe them, it was an angel - visited our hero in a dream and said: ”I will give you some ideas that will make you rich and famous - well, rich intellectually and famous in the R community - but there’s a catch. For reasons I won’t disclose, for a mortal like you wouldn’t really understand, you must make the piping operator as bad as you possibly can, but without rendering it outright ridiculous. No more than three chars, and remember - as ugly and hard to type as you can.

Hadley agreed and woke up. Being a smart guy that he is, in the morning he constructed an appropriate bundle of characters. Reportedly, the thought process unfolded as follows:

Okay, three characters. Let’s invert that old <-, and elongate it: -->. Meh, waaay too pretty and you type it all with one hand.

I know… ~~> is good. ~>~ even better. One needs to press tilde key twice, then delete one, go to >, repeat, all with Shift pressed. Sweet. But dang, too pretty. I need ugly.

Think, man, think! Let’s see. #>#. Yeah. Nah, that looks half-reasonable. Wait… wait… yes… %>%. That’s it!


Image credit: Dexter’s Laboratory

And the rest is history:

carriers_db2 %>% summarise(delay = mean(arr_delay)) %>% collect()


Image credit: Natalie Cooper

But seriously, the man in question says that’s all because infix operators in R must have the form %something%. By the way, Hadley has at least two books online: Advanced R and R for Data Science.

In Pandas

We know of three modules for piping in Pandas: pandas-ply, dplython and dfply. All three use reasonable piping operators.

pandas-ply, from Coursera, is the simplest of them and closest to the Pandas spirit. It uses a normal dot for chaining and just adds a few methods to the DataFrame. Here’s their motivating example, adapted from the dplyr intro:

grouped_flights = flights.groupby(['year', 'month', 'day'])
output = pd.DataFrame()
output['arr'] = grouped_flights.arr_delay.mean()
output['dep'] = grouped_flights.dep_delay.mean()
filtered_output = output[(output.arr > 30) & (output.dep > 30)]

# instead:

(flights
  .groupby(['year', 'month', 'day'])
  .ply_select(
    arr = X.arr_delay.mean(),
    dep = X.dep_delay.mean())
  .ply_where(X.arr > 30, X.dep > 30))

Less typing and no need for intermediate artifacts. Notice how you refer to the transformed dataframe inside the pipeline by X.

dplython is closer to dplyr. The module provides verbs (functions) similar to the R counterpart, but the pipeline operator is a handsome >>, and there’s this nice diamonds dataset:

(diamonds >> 
  sample_n(10) >> 
  arrange(X.carat) >> 
  select(X.carat, X.cut, X.depth, X.price))

(diamonds >> 
  mutate(carat_bin=X.carat.round()) >> 
  group_by(X.cut, X.carat_bin) >> 
  summarize(avg_price=X.price.mean()))  

What’s with the outer parens? Is this Lisp or something?

Let us mention that in R, all functions are pipable. In Python, you need to make them pipable. dlpython has a special decorator for it, @DelayFunction.

Finally, there is dfply, inspired by dlpython, but with even more functions. It appears less mature than the previous two - pip install won’t work here. The example shows some means of deleting columns from a frame:

diamonds >> drop_endswith('e','y','z') >> head(2)

What these modules provide is mostly syntactic sugar, and using them depends on a personal taste. For example, while the pandas-ply flights example above is convincing, is one of these lines better than the others?

diamonds.ply_where(X.carat > 4).ply_select('carat', 'cut', 'depth', 'price')

diamonds >> sift(X.carat > 4) >> select(X.carat, X.cut, X.depth, X.price)

diamonds[diamonds.carat > 4]['carat', 'cut', 'depth', 'price']

We’d like automatic X in pandas:

diamonds[X.carat > 4][X.carat, X.cut, X.depth, X.price]

P.S.

Apparently it was Stefan Milton Bache who invented the piping operator in the package magrittr. He’s Danish; they have the most complicated and difficult language in Europe, except Hungarians. Danish don’t mind such trivial inconveniences as %>%. By the way, there’s more: %T>%, %$%, %<>%.

Comments