In my reprex below:

  1. RSA is the output of a process that is to be analyzed and whose results is to be grouped.
  2. Each RSA group has varying range of days (datenum) each is observed.
  3. var1 varies less frequently, but each is observed for the same 8 days consecutively.
  4. The RSA groups are to be numbered sequentially within the var1 group; when a new var1 is encountered the RSA group numbering begins anew.
  5. idx_objective is the index that I am looking for.

Reprex:

var1 <- c("aaa", "aaa", "aaa", "aaa", "aaa", "aaa", "aaa", "aaa", "bbb", "bbb", "bbb", "bbb", "bbb", "bbb", "bbb", "bbb", "ccc", "ccc", "ccc", "ccc", "ccc", "ccc", "ccc", "ccc", "ddd", "ddd", "ddd", "ddd", "ddd", "ddd", "ddd", "ddd")
RSA <- c(1,1,1,0,-1,-1,0,-1, 
         0,0,0,-1,-1,-1,1,1,
        -1,-1,0,1,1,-1,-1,1, 
        1,-1,-1,1,1,0,-1,1)
idx_objective <- c(1,1,1,2,3,3,4,5, 
                   1,1,1,2,2,2,3,3, 
                   1,1,2,3,3,4,4,5, 
                   1,2,2,3,3,4,5,6)
objective.df <- data.frame(var1, RSA, idx_objective) %>%
  group_by(var1) %>%
  mutate  (datenum = 1:n()) %>%
  relocate (datenum, .after = var1)

I have reviewed many SO posts that appear to be similar...

1dplyr: group variables then assign unique names based on unique grouping

revolves around correct use of cumsum, which I think I am using correctly

[https://stackoverflow.com/questions/40519129/how-to-assign-unique-id-for-group-of-duplicates]

[2]How to divide between groups of rows using dplyr

The last two don't seem applicable; two others referenced in the following:

Approach #1: using a change flag and cumsum

objective.try1 <- objective.df %>%
  group_by(var1) %>%
  mutate(chg_flg = ifelse(lag(RSA) != RSA, 1, 0) %>%
    coalesce(0)) %>%
  relocate(chg_flg, .after = RSA) %>%
  relocate (datenum, .after var1) %>%
  group_by(var1, chg_flg) %>%
  mutate (idx_objective_try = cumsum(chg_flg) +1) %>%

Results:

objective.try1 <- c(1, 1, 1, 2, 3, 1, 4, 5, 1, 1, 1, 2, 1, 1, 3, 1, 1, 2, 3, 1, 4, 1, 5, 1, 2, 1, 3, 1, 4, 5, 6)
objective.df <- data.frame(var1, RSA, idx_objective, objective.try1 %>%
  group_by(var1) %>%
  mutate (datenum = 1: n()) %>%
  relocate(datenum, .after = var1)

Observation for objective.try1: rows 1-5 work, but row 6 incorrectly restarts the idx numbering over again, but then resumes correctly reflecting the chg_flg until rows 13 and 14 at which time the idx numbering is again incorrectly restarted, but then resumes again being correct for one row until being incorrect again at rows 16, 21, 23, 27, and 29.

Following the logic at row 6, for example -- the previous idx_objective_try (row 5) is 3 and the chg_flg value at row 6 is zero, so the idx_objecitve_try ought to be the correct value of 3. Why isn't it?

Approach #2: Using match and duplicated:

objective.try2 <- objective.df %>%
  group_by(var1) %>%. # var1 corresponds to "prop" in the SO post (both the slower moving variables)
  mutate(well_rep1 = match(RSA, unique(RSA)), # "RSA" corresponds to "well" in the SO post (both the faster changing variables)
  well_rep2 = cumsum(!duplicated(RSA))) # approach similar to above

Observation for objective.try2: most rows work, but there again are rows that do not work, though the rows that don't work are different from those in the first try.

I would appreciate it if someone would point out what I am doing wrong.

推荐答案

Focusing on the first group in var1 which is "aaa", you have the following column data for RSA:

objective.df[objective.df$var1 == "aaa", "RSA"]
    RSA
  <dbl>
1     1
2     1
3     1
4     0
5    -1
6    -1
7     0
8    -1

If you use the diff function on this to get differences between one value and the next, you will get 7 differences for the 8 data elements in RSA for "aaa". In other words, if you have a length "n" vector you will get back "n-1" values in the end (without changing arguments).

Since we want to add a new column new_id_objective, we will need to include 8 values for this group, not 7. The combine c will help create the vector for the new column, combining an initial number with the 7 differences from diff, for a total of 8.

So for "aaa", the diff will return 0, 0, -1, -1, 0, 1, -1. These values are compared with 0 (!= 0), with the result of the evaluation being either TRUE or FALSE. The cumsum of TRUE will increment the counter. In other words, the value of new_id_objective will increase row-by-row when there is a difference between consecutive row values (up or down, or anything but zero).

In this example, we used combine c with an initial value of 1. Note that we can use any number that is not zero to start with (while in this case we used 1, we could have used 5, 100, -10, or anything not zero). By doing this, the first evaluation will be TRUE (not equal to zero) and the new_id_objective counter value will start with 0 + 1 = 1. Otherwise, if you did try to use zero, i.e., c(0, diff(RSA)) != 0, this will evaluate to FALSE, and the cumulative sum would start with 0, not 1.

Let me know if this helps.

library(tidyverse)

objective.df %>% 
  group_by(var1) %>% 
  mutate(new_id_objective = cumsum(c(1, diff(RSA)) != 0))

Output

   var1  datenum   RSA idx_objective new_id_objective
   <chr>   <int> <dbl>         <dbl>            <int>
 1 aaa         1     1             1                1
 2 aaa         2     1             1                1
 3 aaa         3     1             1                1
 4 aaa         4     0             2                2
 5 aaa         5    -1             3                3
 6 aaa         6    -1             3                3
 7 aaa         7     0             4                4
 8 aaa         8    -1             5                5
 9 bbb         1     0             1                1
10 bbb         2     0             1                1
# … with 22 more rows

R相关问答推荐

将 11 GB .csv 文件加载为 big.matrix 对象

自举生物等效性结果 R

使用列索引而不是列名在 R dplyr 中进行变异

R创建一个数据框,总结来自另一个数据框的每列唯一值并排序

计算 R 中单独数据框中某个日期范围内的条目数

在 R 中使用 dplyr 包 Lag 函数时,有没有办法省略 NA?

根据 R 中另一列中下面的行 Select 最后一个值

用 dplyr 总结和统计分组 df 中唯一值的数量

计算每个 ID 的唯一值组合数

map_dfr 输出一行而不是一列

您可以在水平条之间放置标签吗?

在具有数值和字符的向量上使用“大于”/“小于”

如何在情节中自定义悬停窗口?

重新排序 tibble 中的一行 - 将其移动到最后一行

R chr 数据类型为带 AM 的日期和时间

R从数据集中的定制子集中获取分位数和均值

统计模式的空数据变化功能

提取列包含 NA 值的单元格并根据结果创建一个新列

在R中从宽到长reshape 数据框

为值的子集创建运行长度 ID