当前位置: 动力学知识库 > 问答 > 编程问答 >

r - From which row a data.frame variable have a constant value

问题描述:

I would like to calculate the mean of a variable on a data.frame in R from the row which another variable start to have a constant value. I usually use dplyr for this database kind of task but I dont figure out how to do this, here is an example:

s<-"no Spc PSize

2 0 6493

2 0 9281

2 12 26183

2 12 36180

2 12 37806

2 12 37765

3 12 36015

3 12 26661

3 0 14031

3 0 5564

3 1 17701

3 1 20808

3 1 31511

3 1 44746

3 1 50534

3 1 54858

3 1 58160

3 1 60326"

d<-read.delim(textConnection(s),sep="",header=T)

mean(d[1:10,3])

sd(d[1:10,3])

From the row 11 the variable spc have a constant value, so this is the place I want to split the data.frame

mean(d[11:18,3])

sd(d[11:18,3])

I can calculate it by hand, but that is not the idea...

网友答案:

You can do it by adding a column that checks whether or not the entry matches the value above, then use cumsum to find the places where the count changes. I group_by'd that, and calculated the summaries you wanted -- I also added an output of which rows were included to demonstrate where it was grabbing from.

d %>%
  mutate(
    row = 1:n()
    , isDiff = Spc != lag(Spc, default = Spc[1])
    , whichGroup = cumsum(isDiff)) %>%
  group_by(whichGroup, Spc) %>%
  summarise(mean = mean(PSize)
            , sd = sd(PSize)
            , whichRows = paste(range(row), collapse = ":"))

Gives:

  whichGroup   Spc    mean        sd whichRows
       <int> <int>   <dbl>     <dbl>     <chr>
1          0     0  7887.0  1971.414       1:2
2          1    12 33435.0  5486.794       3:8
3          2     0  9797.5  5987.073      9:10
4          3     1 42330.5 16866.591     11:18

If you only want the last group, which I can't tell from your post if you do or not, you could instead use filter, like this:

d %>%
  mutate(
    row = 1:n()
    , isDiff = Spc != lag(Spc, default = Spc[1])
    , whichGroup = cumsum(isDiff)) %>%
  filter(whichGroup == max(whichGroup)) %>%
  summarise(Spc = Spc[1]
            , mean = mean(PSize)
            , sd = sd(PSize)
            , whichRows = paste(range(row), collapse = ":"))

Which gives:

  Spc    mean       sd whichRows
1   1 42330.5 16866.59     11:18

Based on the comment, you appear to want the last group vs. the rest, you can get that with this approach:

d %>%
  mutate(
    row = 1:n()
    , isDiff = Spc != lag(Spc, default = Spc[1])
    , whichGroup = cumsum(isDiff)) %>%
  group_by(isLast = whichGroup == max(whichGroup)) %>%
  summarise(mean = mean(PSize)
            , sd = sd(PSize)
            , whichRows = paste(range(row), collapse = ":"))

which gives:

  isLast    mean       sd whichRows
   <lgl>   <dbl>    <dbl>     <chr>
1  FALSE 23597.9 13521.32      1:10
2   TRUE 42330.5 16866.59     11:18
网友答案:

Option 1: Using rleid from the data.table package:

d %>% 
  group_by(rlid = rleid(Spc)) %>% 
  summarise(mean_size = mean(PSize), sd_size = sd(PSize)) %>% 
  slice(n())

gives:

# A tibble: 1 × 3
   rlid mean_size  sd_size
  <int>     <dbl>    <dbl>
1     4   42330.5 16866.59

Option 2: Using rle:

startrow <- sum(head(rle(d$Spc)$lengths, -1)) + 1
d %>%
  slice(startrow:n()) %>% 
  summarise(mean_size = mean(PSize), sd_size = sd(PSize))

gives:

  mean_size  sd_size
1   42330.5 16866.59

Option 3: If you want to calculate for two groups (last and others) you should use group_by instead of filter and create a new grouping vector (rep_vec) with rle:

rep_vec <- c(sum(head(rle(d$Spc)$lengths, -1)), tail(rle(d$Spc)$lengths, 1))

d %>%
  group_by(grp = rep(c('others','last_group'), rep_vec)) %>% 
  summarise(mean_size = mean(PSize), sd_size = sd(PSize))

which gives:

         grp mean_size  sd_size
       (chr)     (dbl)    (dbl)
1 last_group   42330.5 16866.59
2     others   23597.9 13521.32

If you want to include the rows, you can change the code to:

d %>%
  mutate(rn = row_number()) %>% 
  group_by(grp = rep(c('others','last_group'), rep_vec)) %>% 
  summarise(mean_size = mean(PSize), sd_size = sd(PSize), rows = paste0(range(rn), collapse=':'))

which gives:

         grp mean_size  sd_size  rows
       <chr>     <dbl>    <dbl> <chr>
1 last_group   42330.5 16866.59 11:18
2     others   23597.9 13521.32  1:10
网友答案:

So you want to find the index where the middle vector starts to be constant? You can take the diff() of your vector and look for the first time this is different from zero. For example,

vec <- c(1,2,3,4,5,5,5,6,6,6)
diff(vec)
differences <- rev(diff(vec))

# distance from the end of first non-zero
min.dist <- min(which(differences != 0))

# take difference
length(vec) - min.dist + 1

That last value should give you the index where the vector starts to be constant.

分享给朋友:
您可能感兴趣的文章:
随机阅读: