# diff - R how can I calculate difference between rows in a data frame

Here is a simple example of my problem:

``> df <- data.frame(ID=1:10,Score=4*10:1)> dfID Score1 1 402 2 363 3 324 4 285 5 246 6 207 7 168 8 129 9 810 10 4> diff(df)Error in r[i1] - r[-length(r):-(length(r) - lag + 1L)] :non-numeric argument to binary operator``

Can anyone tell me why this error occurs?

diff wants a matrix or a vector rather than a data frame. Try

``````data.frame(diff(as.matrix(df)))
``````

Perhaps you are looking for something like this:

``````> tail(df, -1) - head(df, -1)
ID Score
2   1    -4
3   1    -4
4   1    -4
5   1    -4
6   1    -4
7   1    -4
8   1    -4
9   1    -4
10  1    -4
``````

You can subtract or add two `data.frame`s together if they are the same dimensions. So, what we are doing here is subtracting one `data.frame` that is missing the first row (`tail(df, -1)`) and one that is missing the last row (`head(df, -1)`) and subtracting them.

Because df works on vector or matrix. You can use apply to apply the function across columns like so:

`````` apply( df , 2 , diff )
ID Score
2   1    -4
3   1    -4
4   1    -4
5   1    -4
6   1    -4
7   1    -4
8   1    -4
9   1    -4
10  1    -4
``````

It seems unlikely that you want to calculate the difference in sequential IDs, so you could choose to apply it on all columns except the first like so:

``````apply( df[-1] , 2 , diff )
``````

Or you could use `data.table` (not that it adds anything here I just really want to start using it!), and I am again assuming that you do not want to apply `diff` to the ID column:

``````DT <- data.table(df)
DT[ , list(ID,Score,Diff=diff(Score))  ]
ID Score Diff
1:  1    40   -4
2:  2    36   -4
3:  3    32   -4
4:  4    28   -4
5:  5    24   -4
6:  6    20   -4
7:  7    16   -4
8:  8    12   -4
9:  9     8   -4
10: 10     4   -4
``````

And thanks to @AnandaMahto an alternative syntax that gives more flexibility to choose which columns to run it on could be:

``````DT[, lapply(.SD, diff), .SDcols = 1:2]
``````

Here `.SDcols = 1:2` means you want to apply the `diff` function to columns 1 and 2. If you have 20 columns and didn't want to apply it to ID you could use `.SDcols=2:20` as an example.

I would like to show an alternative way for doing such kind of things even often I have the feeling it is not appreciated doing this in that way: using sql.

``````sqldf(paste("SELECT a.ID,a.Score"
,"      , a.Score - (SELECT b.Score"
,"                   FROM df b"
,"                   WHERE b.ID < a.ID"
,"                   ORDER BY b.ID DESC"
,"                   ) diff"
," FROM df a"
)
)
``````

The code seems complicated but it is not and it has some advantage, as you can see at the results:

``````    ID Score diff
1   1    40 <NA>
2   2    36 -4.0
3   3    32 -4.0
4   4    28 -4.0
5   5    24 -4.0
6   6    20 -4.0
7   7    16 -4.0
8   8    12 -4.0
9   9     8 -4.0
10 10     4 -4.0
``````

One advantage is that you use the original dataframe (without converting into other classes) and you get a data frame (put it in res <- ....). Another advantage is that you have still all rows. And the third advantage is that you can easily consider grouping factors. For example:

``````df2 <- data.frame(ID=1:10,grp=rep(c("v","w"), each=5),Score=4*10:1)

sqldf(paste("SELECT a.ID,a.grp,a.Score"
,"      , a.Score - (SELECT b.Score"
,"                   FROM df2 b"
,"                   WHERE b.ID < a.ID"
,"                         AND a.grp = b.grp"
,"                   ORDER BY b.ID DESC"
,"                   ) diff"
," FROM df2 a"
)
)

ID grp Score diff
1   1   v    40 <NA>
2   2   v    36 -4.0
3   3   v    32 -4.0
4   4   v    28 -4.0
5   5   v    24 -4.0
6   6   w    20 <NA>
7   7   w    16 -4.0
8   8   w    12 -4.0
9   9   w     8 -4.0
10 10   w     4 -4.0
``````

Adding this a few years later for completeness- you can use a simple `[.data.frame` subseting in order to achieve this too

``````df[-1, ] - df[-nrow(df), ]
#    ID Score
# 2   1    -4
# 3   1    -4
# 4   1    -4
# 5   1    -4
# 6   1    -4
# 7   1    -4
# 8   1    -4
# 9   1    -4
# 10  1    -4
``````

Another option using `dplyr` would be using `mutate_each` to loop through all the columns, get the difference of the column (`.`) with the `lag` of the column (`.`) and remove the NA element at the top with `na.omit()`

``````library(dplyr)
df %>%
mutate_each(funs(. - lag(.))) %>%
na.omit()
``````

Or with `shift` from `data.table`. Convert the 'data.frame' to 'data.table' (`setDT(df)`), loop through the columns (`lapply(.SD, ..`)`) and get the difference between the column (`x`) and the`lag`(`shift`by default gives the`lag`as`type = "lag"`). Remove the first observation i.e. NA element.

``````library(data.table)
setDT(df)[, lapply(.SD, function(x) (x- shift(x))[-1])]
``````