使用 groupby 和 n_distinct 计算新的唯一 ID

发布于08月23日

以下代码计算每年的唯一ID数.我的问题是:如何统计new个唯一ID，即前几年没有出现的ID？

  group_by(year) %>% 
  summarize(count=n_distinct(ID))

例如，我需要在下面创建变量WANG_COUNT

Year	ID	count	wanted_count
2000	1	3	3
2000	2	3	3
2000	3	3	3
2001	2	2	0
2001	3	2	0
2002	3	2	1
2002	4	2	1
2003	4	2	1
2003	7	2	1
2003	4	2	1

请参阅下面的数据:

structure(list(Year = c(2000L, 2000L, 2000L, 2001L, 2001L, 2002L, 
2002L, 2003L, 2003L, 2003L), ID = c(1L, 2L, 3L, 2L, 3L, 3L, 4L, 
4L, 7L, 4L), count = c(3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L
), wanted_count = c(3L, 3L, 3L, 0L, 0L, 1L, 1L, 1L, 1L, 1L)), class = "data.frame", row.names = c(NA, 
-10L))

推荐答案

library(dplyr)
df %>%
  mutate(cum_new = cumsum(!duplicated(ID))) %>%
  group_by(Year) %>% 
  summarize(total = max(cum_new), .groups = "drop") %>%
  mutate(
    result = c(first(total), diff(total)),
    total = NULL
  ) %>%
  left_join(df, by = "Year")
# # A tibble: 10 × 5
#     Year result    ID count wanted_count
#    <int>  <int> <int> <int>        <int>
#  1  2000      3     1     3            3
#  2  2000      3     2     3            3
#  3  2000      3     3     3            3
#  4  2001      0     2     2            0
#  5  2001      0     3     2            0
#  6  2002      1     3     2            1
#  7  2002      1     4     2            1
#  8  2003      1     4     2            1
#  9  2003      1     7     2            1
# 10  2003      1     4     2            1

使用此数据:

df = read.table(text = 'Year    ID  count   wanted_count
2000    1   3   3
2000    2   3   3
2000    3   3   3
2001    2   2   0
2001    3   2   0
2002    3   2   1
2002    4   2   1
2003    4   2   1
2003    7   2   1
2003    4   2   1', header = T)