当前位置: 动力学知识库 > 问答 > 编程问答 >

R: how to join the duplicate rows in one dataframe

问题描述:

I have one dataframe with some duplicated rows, which I want to join only duplicated rows. Given an example below:

 name b c d

1 yp 3 NA NA

2 yp 3 1 NA

3 IG NA 3 NA

4 OG 4 1 0

the duplicated rows are defined by the rows which have the same name. Thus in this example, row 1 and row 2 need to be join somehow, with the NA values replaced by possible numerical value.

 name b c d

1 yp 3 1 NA

2 IG NA 3 NA

3 OG 4 1 0

Assumption: if two rows have the same name, and their corresponding columns are not NA, then the corresponding column values must be the same numerical value.

网友答案:

Here's a dplyr approach:

library(dplyr)
df %>% group_by(name) %>% summarise_each(funs(first(.[!is.na(.)])))
#Source: local data frame [3 x 4]
#
#    name     b     c     d
#  (fctr) (int) (int) (int)
#1     IG    NA     3    NA
#2     OG     4     1     0
#3     yp     3     1    NA

This groups the data by "name" and for each unique name, returns a single row and in each of the other columns returns the first value that is not NA or, NA if all entries are NAs. This is in line with the assumption that if several numerical values are present, they must all be the same (and hence, we can pick the first one).

网友答案:

Perhaps you can try something like the following:

setDT(mydf)[, lapply(.SD, function(x) {
  if (all(is.na(x))) NA else x[!is.na(x)][1]
}), by = name]
#    name  b c  d
# 1:   yp  3 1 NA
# 2:   IG NA 3 NA
# 3:   OG  4 1  0

Basically, if all values are NA, just take the the first NA value, or else, take the first non-NA value.


As pointed out by @docendodiscimus, this can be simplified to:

setDT(mydf)[, lapply(.SD, function(x) x[!is.na(x)][1]), by = name]
网友答案:

A quick way to solve this would be to use the dplyr package and group the on the variables you want to join on and then handle how to join the rows. A good way to join the rows could be to take the mean of all but the NA values. In your case the code would be:

library(dplyr)

df %>% group_by(name) %>%
       summarise_each(funs(mean, "mean", mean(., na.rm = TRUE)))
分享给朋友:
您可能感兴趣的文章:
随机阅读: