The following problem prevents me so far from a really flexible usage of data.table aggregations.
DT <- data.table(C1=c("a","b","b"),
sum_cols <- c("C2","C3")
#I want to apply a custom aggregation over multiple columns
### Part 1 of question ###
#but what if I want to add another aggregation, e.g. count
#this is not working as intended (creates 4 rows instead of 2 and doesnt contain sum_cols)
### Part 2 of question ###
# or another function for another set of colums and adding a prefix to keep them appart?
mean_cols <- c("C3","C4")
#intended table structure (with 2 rows again)
C1 sum_C2 sum_C3 mean_C3 mean_C4
I know I can always merge various single aggregation results by some key but I`m sure there must be a correct, flexible and easy way to do what I would like to do (especially Part 2).
The first thing to notice is that data.table's
j argument expects a list output, which can be built with
c, as mentioned in @akrun's answer. Here are two ways to do it:
set.seed(1) DT <- data.table(C1=c("a","b","b"), C2=round(rnorm(4),4), C3=1:12, C4=9:12) sum_cols <- c("C2","C3") mean_cols <- c("C3","C4") # with the development version, 1.10.1+ DT[, c( .N, sum = lapply(.SD[, ..sum_cols], sum), mean = lapply(.SD[, ..mean_cols], mean) ), by=C1] # in earlier versions DT[, c( .N, sum = lapply(.SD[, sum_cols, with=FALSE], sum), mean = lapply(.SD[, mean_cols, with=FALSE], mean) ), by=C1]
mget returns a list and
c connects elements together to make a list.
If you turn on the
verbose data.table option for these calls, you'll see a message:
The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future.
Also, you'll see that the optimized group mean and sum are not being used (see
?GForce for details). We can get around this by following FAQ 1.6 perhaps, but I couldn't figure out how.
The output are
lists, so we use
c to concatenate both the
DT[,c(.N,lapply(.SD,sum)),by=C1,.SDcols=sum_cols] # C1 N C2 C3 # 1: a 4 0.288 22 # 2: b 8 0.576 56