r - Learning to understand plyr, ddply

I've been attempting to understand what and how plyr works through trying different variables and functions and seeing what results. So I'm more looking for an explanation of how plyr works than specific fix it answers. I've read the documentation but my newbie brain is still not getting it.

Some data and names:

``mydf<- data.frame(c("a","a","b","b","c","c"),c("e","e","e","e","e","e"),c(1,2,3,10,20,30),c(5,10,20,20,15,10))colnames(mydf)<-c("Model", "Class","Length", "Speed")mydf``

Question 1: Summarise versus Transform Syntax

So if I Enter: `ddply(mydf, .(Model), summarise, sum = Length+Length)`

I get:

```Model ..11 a 22 a 43 b 64 b 205 c 406 c 60``

and if I enter: `ddply(mydf, .(Model), summarise, Length+Length)` I get the same result.

Now if use transform: `ddply(mydf, .(Model), transform, sum = (Length+Length))`

I get:

`` Model Class Length Speed sum1 a e 1 5 22 a e 2 10 43 b e 3 20 64 b e 10 20 205 c e 20 15 406 c e 30 10 60``

But if I state it like the first summarise :

`ddply(mydf, .(Model), transform, (Length+Length))`

`` Model Class Length Speed1 a e 1 52 a e 2 103 b e 3 204 b e 10 205 c e 20 156 c e 30 10``

So why does adding "sum =" make a difference?

Question 2: Why don't these work?

`ddply(mydf, .(Model), sum, Length+Length)` #Error in function (i) : object 'Length' not found

``ddply(mydf, .(Model), length, mydf\$Length) #Error in .fun(piece, ...) :``

2 arguments passed to 'length' which requires 1

These examples are more to show that somewhere I'm fundamentally not understanding how to use plyr.

Any anwsers or explanations are appreciated.

The syntax is:

``````ddply(data.frame, variable(s), function, optional arguments)
``````

where the function is expected to return a `data.frame`. In your situation,

• summarise is a function that will transparently create a new data.frame, with the results of the expression that you provide as further arguments (...)

• transform, a base R function, will transform the data.frames (first split by the variable(s)), adding new columns according to the expression(s) that you provide as further arguments. These need to be named, that's just the way transform works.

If you use other functions than subset, transform, mutate, with, within, or summarise, you'll need to make sure they return a data.frame (length and sum don't), or at the very least a vector of appropriate length for the output.

I find that when I'm having trouble "visualizing" how any of the functional tools in R work, that the easiest thing to do is browser a single instance:

``````ddply(mydf, .(Model), function(x) browser() )
``````

Then inspect `x` in real-time and it should all make sense. You can then test out your function on x, and if it works you're golden (barring other groupings being different than your first x).

The way I understand the `ddply(... , .(...) , summarise, ...)` operations are are designed to reduce the number of rows to match the number of distinct combinations inside the `.(...)` grouping variables. So for your first example, this seemed natural:

``````ddply(mydf, .(Model), summarise, sL = sum(Length)
Model sL
1     a  3
2     b 13
3     c 50
``````

OK. Seems to work for me (not a regular plyr user). The `transform` operations on the other hand I understand to be making new columns of the same length as the dataframe. That was what your first `transform` call accomplished. Your second one (a failure) was:

``````ddply(mydf, .(Model), transform, (Length+Length))
``````

That one did not create a new name for the operation that was performed, so there was nothing new assigned in the result. When you added `sum=(Length+Length)`, there suddenly was a name available, (and the `sum` function was not used). It's generally a bad idea to use the names of function for column names.

On question two, I think that the .fun argument needs to be a plyr-function or something that makes sense applied to a (split) dataframe as a whole rather any old function. There is no `sum.data.frame` function. But 'nrow' or 'ncol' do make sense. You can even get 'str' to work in that position. The length function applied to a dataframe gives the number of columns:

`````` ddply(mydf, .(Model), length )  # all 4's
``````