当前位置: 动力学知识库 > 问答 > 编程问答 >

diff - R how can I calculate difference between rows in a data frame

问题描述:

Here is a simple example of my problem:

> df <- data.frame(ID=1:10,Score=4*10:1)

> df

ID Score

1 1 40

2 2 36

3 3 32

4 4 28

5 5 24

6 6 20

7 7 16

8 8 12

9 9 8

10 10 4

> diff(df)

Error in r[i1] - r[-length(r):-(length(r) - lag + 1L)] :

non-numeric argument to binary operator

Can anyone tell me why this error occurs?

网友答案:

diff wants a matrix or a vector rather than a data frame. Try

data.frame(diff(as.matrix(df)))
网友答案:

Perhaps you are looking for something like this:

> tail(df, -1) - head(df, -1)
   ID Score
2   1    -4
3   1    -4
4   1    -4
5   1    -4
6   1    -4
7   1    -4
8   1    -4
9   1    -4
10  1    -4

You can subtract or add two data.frames together if they are the same dimensions. So, what we are doing here is subtracting one data.frame that is missing the first row (tail(df, -1)) and one that is missing the last row (head(df, -1)) and subtracting them.

网友答案:

Because df works on vector or matrix. You can use apply to apply the function across columns like so:

 apply( df , 2 , diff )
   ID Score
2   1    -4
3   1    -4
4   1    -4
5   1    -4
6   1    -4
7   1    -4
8   1    -4
9   1    -4
10  1    -4

It seems unlikely that you want to calculate the difference in sequential IDs, so you could choose to apply it on all columns except the first like so:

apply( df[-1] , 2 , diff )

Or you could use data.table (not that it adds anything here I just really want to start using it!), and I am again assuming that you do not want to apply diff to the ID column:

DT <- data.table(df)
DT[ , list(ID,Score,Diff=diff(Score))  ]
    ID Score Diff
 1:  1    40   -4
 2:  2    36   -4
 3:  3    32   -4
 4:  4    28   -4
 5:  5    24   -4
 6:  6    20   -4
 7:  7    16   -4
 8:  8    12   -4
 9:  9     8   -4
10: 10     4   -4

And thanks to @AnandaMahto an alternative syntax that gives more flexibility to choose which columns to run it on could be:

DT[, lapply(.SD, diff), .SDcols = 1:2]

Here .SDcols = 1:2 means you want to apply the diff function to columns 1 and 2. If you have 20 columns and didn't want to apply it to ID you could use .SDcols=2:20 as an example.

网友答案:

I would like to show an alternative way for doing such kind of things even often I have the feeling it is not appreciated doing this in that way: using sql.

sqldf(paste("SELECT a.ID,a.Score"
            ,"      , a.Score - (SELECT b.Score"
            ,"                   FROM df b"
            ,"                   WHERE b.ID < a.ID"
            ,"                   ORDER BY b.ID DESC"
            ,"                   ) diff"
            ," FROM df a"
            )
      )

The code seems complicated but it is not and it has some advantage, as you can see at the results:

    ID Score diff
 1   1    40 <NA>
 2   2    36 -4.0
 3   3    32 -4.0
 4   4    28 -4.0
 5   5    24 -4.0
 6   6    20 -4.0
 7   7    16 -4.0
 8   8    12 -4.0
 9   9     8 -4.0
 10 10     4 -4.0

One advantage is that you use the original dataframe (without converting into other classes) and you get a data frame (put it in res <- ....). Another advantage is that you have still all rows. And the third advantage is that you can easily consider grouping factors. For example:

df2 <- data.frame(ID=1:10,grp=rep(c("v","w"), each=5),Score=4*10:1)

sqldf(paste("SELECT a.ID,a.grp,a.Score"
            ,"      , a.Score - (SELECT b.Score"
            ,"                   FROM df2 b"
            ,"                   WHERE b.ID < a.ID"
            ,"                         AND a.grp = b.grp"
            ,"                   ORDER BY b.ID DESC"
            ,"                   ) diff"
     ," FROM df2 a"
     )
)


   ID grp Score diff
1   1   v    40 <NA>
2   2   v    36 -4.0
3   3   v    32 -4.0
4   4   v    28 -4.0
5   5   v    24 -4.0
6   6   w    20 <NA>
7   7   w    16 -4.0
8   8   w    12 -4.0
9   9   w     8 -4.0
10 10   w     4 -4.0
网友答案:

Adding this a few years later for completeness- you can use a simple [.data.frame subseting in order to achieve this too

df[-1, ] - df[-nrow(df), ]
#    ID Score
# 2   1    -4
# 3   1    -4
# 4   1    -4
# 5   1    -4
# 6   1    -4
# 7   1    -4
# 8   1    -4
# 9   1    -4
# 10  1    -4
网友答案:

Another option using dplyr would be using mutate_each to loop through all the columns, get the difference of the column (.) with the lag of the column (.) and remove the NA element at the top with na.omit()

library(dplyr)
df %>%
    mutate_each(funs(. - lag(.))) %>%
    na.omit() 

Or with shift from data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), loop through the columns (lapply(.SD, ..)) and get the difference between the column (x) and thelag(shiftby default gives thelagastype = "lag"`). Remove the first observation i.e. NA element.

library(data.table)
setDT(df)[, lapply(.SD, function(x) (x- shift(x))[-1])]
分享给朋友:
您可能感兴趣的文章:
随机阅读: