R: split-group-apply-combine

If aaply, dapply, sapply and all things that look like this confuse you, this post is for you.

First make sure that you have the plyr package working in R. You can verify this by typing:

library(plyr)

Now open the CO2 dataset from the R library.

data()
CO2

Have a quick look at the dataset. We are interested in finding the average uptake per plant, type and treatment. We could do this the long way by using a few for loops and the subset method, but this is not elegant. Instead just go ahead and type:


ddply(CO2, c('Plant', 'Type', 'Treatment'), function(CO2) c(mean(CO2$uptake)) )

You should see the following output:

Plant Type Treatment V1
1 Qn1 Quebec nonchilled 33.22857
2 Qn2 Quebec nonchilled 35.15714
3 Qn3 Quebec nonchilled 37.61429
4 Qc1 Quebec chilled 29.97143
5 Qc3 Quebec chilled 32.58571
6 Qc2 Quebec chilled 32.70000
7 Mn3 Mississippi nonchilled 24.11429
8 Mn2 Mississippi nonchilled 27.34286
9 Mn1 Mississippi nonchilled 26.40000
10 Mc2 Mississippi chilled 12.14286
11 Mc3 Mississippi chilled 17.30000
12 Mc1 Mississippi chilled 18.00000

Without me explaining how this method works exactly just yet, also try the following:


ddply(CO2, c('Plant', 'Type'), function(CO2) c(mean(CO2$uptake)) )

You should now see the following output:

Plant Type V1
1 Qn1 Quebec 33.22857
2 Qn2 Quebec 35.15714
3 Qn3 Quebec 37.61429
4 Qc1 Quebec 29.97143
5 Qc3 Quebec 32.58571
6 Qc2 Quebec 32.70000
7 Mn3 Mississippi 24.11429
8 Mn2 Mississippi 27.34286
9 Mn1 Mississippi 26.40000
10 Mc2 Mississippi 12.14286
11 Mc3 Mississippi 17.30000
12 Mc1 Mississippi 18.00000

Notice the difference in both the method call and the output. Now also try the following:

ddply(CO2, c('Plant', 'Type', 'Treatment'), function(CO2) c(length(CO2$uptake)) )

ddply(CO2, c('Plant', 'Type', 'Treatment'), function(CO2)
c(length(CO2$uptake), mean(CO2$uptake)) )

At this stage, I don’t think I need to explain how these functions work anymore, you’ve seen it yourself.

One thing to note though about this package. If the input is a dataframe and the output to be dataframe then you would use the ddply method. If you want the output to be a list while the input is an array then you would use laply. The first two letters would only indicate the type of data going in and out of the plyr method.

Using the plyr package this way can save you an INSANE amount of time, highly recommend it, kudos to the creator.

=======
UPDATE!

Due to a recent discussion on stackoverflow I feel obliged to also note that there are alternative ways to do operations like this that are less computationally expensive. You may alternatively want to try:

aggregate(uptake ~ Type + Plant + Treatment, data=CO2, FUN = max)