问题描述:

I would like to calculate the mean of a variable on a data.frame in R from the row which another variable start to have a constant value. I usually use dplyr for this database kind of task but I dont figure out how to do this, here is an example:

`s<-"no Spc PSize`

2 0 6493

2 0 9281

2 12 26183

2 12 36180

2 12 37806

2 12 37765

3 12 36015

3 12 26661

3 0 14031

3 0 5564

3 1 17701

3 1 20808

3 1 31511

3 1 44746

3 1 50534

3 1 54858

3 1 58160

3 1 60326"

d<-read.delim(textConnection(s),sep="",header=T)

mean(d[1:10,3])

sd(d[1:10,3])

From the row 11 the variable spc have a constant value, so this is the place I want to split the data.frame

`mean(d[11:18,3])`

sd(d[11:18,3])

I can calculate it by hand, but that is not the idea...

You can do it by adding a column that checks whether or not the entry matches the value above, then use `cumsum`

to find the places where the count changes. I `group_by`

'd that, and calculated the summaries you wanted -- I also added an output of which rows were included to demonstrate where it was grabbing from.

```
d %>%
mutate(
row = 1:n()
, isDiff = Spc != lag(Spc, default = Spc[1])
, whichGroup = cumsum(isDiff)) %>%
group_by(whichGroup, Spc) %>%
summarise(mean = mean(PSize)
, sd = sd(PSize)
, whichRows = paste(range(row), collapse = ":"))
```

Gives:

```
whichGroup Spc mean sd whichRows
<int> <int> <dbl> <dbl> <chr>
1 0 0 7887.0 1971.414 1:2
2 1 12 33435.0 5486.794 3:8
3 2 0 9797.5 5987.073 9:10
4 3 1 42330.5 16866.591 11:18
```

If you only want the last group, which I can't tell from your post if you do or not, you could instead use `filter`

, like this:

```
d %>%
mutate(
row = 1:n()
, isDiff = Spc != lag(Spc, default = Spc[1])
, whichGroup = cumsum(isDiff)) %>%
filter(whichGroup == max(whichGroup)) %>%
summarise(Spc = Spc[1]
, mean = mean(PSize)
, sd = sd(PSize)
, whichRows = paste(range(row), collapse = ":"))
```

Which gives:

```
Spc mean sd whichRows
1 1 42330.5 16866.59 11:18
```

Based on the comment, you appear to want the last group vs. the rest, you can get that with this approach:

```
d %>%
mutate(
row = 1:n()
, isDiff = Spc != lag(Spc, default = Spc[1])
, whichGroup = cumsum(isDiff)) %>%
group_by(isLast = whichGroup == max(whichGroup)) %>%
summarise(mean = mean(PSize)
, sd = sd(PSize)
, whichRows = paste(range(row), collapse = ":"))
```

which gives:

```
isLast mean sd whichRows
<lgl> <dbl> <dbl> <chr>
1 FALSE 23597.9 13521.32 1:10
2 TRUE 42330.5 16866.59 11:18
```

**Option 1:** Using `rleid`

from the `data.table`

package:

```
d %>%
group_by(rlid = rleid(Spc)) %>%
summarise(mean_size = mean(PSize), sd_size = sd(PSize)) %>%
slice(n())
```

gives:

```
# A tibble: 1 × 3
rlid mean_size sd_size
<int> <dbl> <dbl>
1 4 42330.5 16866.59
```

**Option 2:** Using `rle`

:

```
startrow <- sum(head(rle(d$Spc)$lengths, -1)) + 1
d %>%
slice(startrow:n()) %>%
summarise(mean_size = mean(PSize), sd_size = sd(PSize))
```

gives:

```
mean_size sd_size
1 42330.5 16866.59
```

**Option 3:** If you want to calculate for two groups (last and others) you should use `group_by`

instead of `filter`

and create a new grouping vector (`rep_vec`

) with `rle`

:

```
rep_vec <- c(sum(head(rle(d$Spc)$lengths, -1)), tail(rle(d$Spc)$lengths, 1))
d %>%
group_by(grp = rep(c('others','last_group'), rep_vec)) %>%
summarise(mean_size = mean(PSize), sd_size = sd(PSize))
```

which gives:

```
grp mean_size sd_size
(chr) (dbl) (dbl)
1 last_group 42330.5 16866.59
2 others 23597.9 13521.32
```

If you want to include the rows, you can change the code to:

```
d %>%
mutate(rn = row_number()) %>%
group_by(grp = rep(c('others','last_group'), rep_vec)) %>%
summarise(mean_size = mean(PSize), sd_size = sd(PSize), rows = paste0(range(rn), collapse=':'))
```

which gives:

```
grp mean_size sd_size rows
<chr> <dbl> <dbl> <chr>
1 last_group 42330.5 16866.59 11:18
2 others 23597.9 13521.32 1:10
```

So you want to find the index where the middle vector starts to be constant? You can take the `diff()`

of your vector and look for the first time this is different from zero. For example,

```
vec <- c(1,2,3,4,5,5,5,6,6,6)
diff(vec)
differences <- rev(diff(vec))
# distance from the end of first non-zero
min.dist <- min(which(differences != 0))
# take difference
length(vec) - min.dist + 1
```

That last value should give you the index where the vector starts to be constant.

您可能感兴趣的文章：

- Training a fully convolutional neural network with inputs of variable size takes unreasonably long time in Keras/TensorFlow
- c - What is the difference between commented else() part and not commented else() in below code
- How to execute Rust code directly on Unix systems? (using the shebang)
- How can I use this Github library in android studio
- Delphi TPngImage memory leak
- php - Reading a specific line from a text file
- clr - Any implementation of an Unrolled Linked List in C#?
- Finding Hudson Log Files
- Forward to a payment-gateway together with POST data using cURL (or any other PHP server side solution)
- WCF in Winforms app - is it always single-threaded?

随机阅读：

**推荐内容**-

**热点内容**