当前位置: 动力学知识库 > 问答 > 编程问答 >

r - Collapse data with rules for a list of data frames

问题描述:

I have a data frame that I need to collapse based on a defined group. The data consist of several hundred groups. Each group may have anywhere from 2-5 rows. For simplicity, my example shows 3 groups with 2-4 rows.

I want to flatten the replicates within each group. For each column in a group, I want to return the maximum occurring value that is not NA. The issue lies in what to do in the case of a tie. For ties, I need to set custom rules based on the types of values that are tied. One potential desperate option would be to paste the tied values together separated by a comma and I can deal with them in find/replace fashion.

To get the maximum occurring value, I was able to use the max function. Any suggestions on how to deal with ties?

#Input Data Example

> data

Group Loc1 Loc2 Loc3 Loc4

1 Group1 A/B A/A B/B NA

2 Group1 A/B A/A B/B A/A

3 Group1 A/A A/A A/A NA

4 Group1 A/A A/A A/A NA

5 Group2 A/A NA C/C B/B

6 Group2 B/B A/A C/C B/B

7 Group2 B/B A/A C/C B/B

8 Group3 B/B B/B NA B/B

9 Group3 B/B B/B NA A/A

#Desired Collapsed Output

> data.collapsed

Group Loc1 Loc2 Loc3 Loc4

1 Group1 NA A/A A/B A/A

2 Group2 B/B A/A C/C B/B

3 Group3 B/B B/B NA A/B

FINAL Code (updated Jan 27,2015)

library(data.table)

#Data Frame

#Each group has replicates of data that need to be collapsed to make a consensus data replicate

data = rbind(c("Group1","A/B", "A/A","B/B",NA), c("Group1","A/B", "A/A","B/B","A/A"), c("Group1","A/A", "A/A","A/A",NA),

c("Group1","A/A", "A/A","A/A",NA), c("Group2","A/A", NA,"C/C","B/B"), c("Group2","B/B", "A/A","C/C","B/B"),

c("Group2","B/B", "A/A","C/C","B/B"), c("Group3","B/B", "B/B",NA,"B/B"), c("Group3","B/B", "B/B",NA,"A/A"))

colnames(data) = c("Group", "Loc1", "Loc2", "Loc3", "Loc4")

data = as.data.frame(data)

data

#Define acceptable value types; these could be used to define what to do in the case of a tie

same.letter = c("A/A","B/B","C/C")

diff.letter = c("A/B","A/C","B/C")

#Function for collapsing data with rules

RepMerge = function(col) {

z = table(col);

z.max = which(z==max(z));

ifelse(length(z.max) > 2, "NA", #if tied between more than 2 different values, report NA

ifelse(length(z.max) == 1, names(z)[z.max], #if one max value, report that value

ifelse(length(z.max) == 2 & names(z)[z.max][1] %in% same.letter & names(z)[z.max][2] %in% same.letter, paste(substring(names(z)[z.max][1],1,1),substring(names(z)[z.max][2],1,1), sep="/"), #if both max values are different but are in 'same.letter', report a combination

ifelse(length(z.max) == 2 & names(z)[z.max][1] %in% diff.letter | names(z)[z.max][2] %in% diff.letter, "NA", "Check Code")))) #if one of the max values is in diff.letter, report NA. If no cases fit the above, report "Check Code"

}

setDT(data)[,lapply(.SD,RepMerge),Group] # run function to collapse the data

Thank you,

SC2

网友答案:

Here is the solution based on data.table:

library(data.table)
setDT(data)[,lapply(.SD,function(cl) {z<-table(cl);z.max<-which(z==max(z));ifelse(length(z.max)>1,"NA",names(z)[z.max])}),Group]

#    Group Loc1 Loc2 Loc3 Loc4
#1: Group1   NA  A/A   NA   NA
#2: Group2  B/B  A/A  C/C  B/B
#3: Group3  B/B  B/B   NA   NA

By modifying ifelse, you can set the desired rules for treating ties and NAs.

PS: You were curious about why the max function ignores NAs in your code. It happens because your data table contains strings 'NA', not actual NAs.

分享给朋友:
您可能感兴趣的文章:
随机阅读: