R 如何按组确定跨两列和两行的两个日期之间的天数差异

发布于11月17日

我希望通过两列和两行的分组来确定天数的差异.基本上，从结束日减go 后续行中的后续开始日，并将差值记录为数据帧中的新列，并在识别出新组(ID)时重新开始.

Start_Date   End_Date     ID   
  
2014-05-09   2015-05-08   01
2015-05-09   2016-05-08   01 
2016-05-11   2017-05-10   01
2017-05-11   2018-05-10   01
2016-08-29   2017-08-28   02
2017-08-29   2018-08-28   02

结果应该类似于下表.

Start_Date   End_Date     ID   Days_Difference 
  
2014-05-09   2015-05-08   01         NA
2015-05-09   2016-05-08   01         01
2016-05-11   2017-05-10   01         03
2017-05-11   2018-05-10   01         01
2016-08-29   2017-08-28   02         NA
2017-08-29   2018-08-28   02         01

从本质上讲，我想要取不同组(ID)的结束日期及其左对角线开始日期的差值.这件事让我很不好受.我不认为我的代码会有帮助.任何使用tidyverse、data.table或base R的解决方案都将非常受欢迎！

推荐答案

我们可能会得到分组后‘Start_Date’和‘End_Date’的lead(下一个元素)之间的差值

library(dplyr)
df1 <- df1 %>%
   mutate(across(ends_with("Date"), as.Date)) %>%
   group_by(ID) %>% 
   mutate(Days_Difference = as.numeric(lag(lead(Start_Date) - End_Date))) %>% 
   ungroup

-输出

df1
# A tibble: 6 × 4
  Start_Date End_Date      ID Days_Difference
  <date>     <date>     <int>           <dbl>
1 2014-05-09 2015-05-08     1              NA
2 2015-05-09 2016-05-08     1               1
3 2016-05-11 2017-05-10     1               3
4 2017-05-11 2018-05-10     1               1
5 2016-08-29 2017-08-28     2              NA
6 2017-08-29 2018-08-28     2               1

或类似的逻辑，数据.table

library(数据.table)
setDT(df1)[, Days_Difference := 
    as.numeric(shift(shift(as.IDate(Start_Date), type = "lead") - 
       as.IDate(End_Date))), ID]

-输出

> df1
   Start_Date   End_Date    ID Days_Difference
       <char>     <char> <int>           <num>
1: 2014-05-09 2015-05-08     1              NA
2: 2015-05-09 2016-05-08     1               1
3: 2016-05-11 2017-05-10     1               3
4: 2017-05-11 2018-05-10     1               1
5: 2016-08-29 2017-08-28     2              NA
6: 2017-08-29 2018-08-28     2               1

数据

df1 <- structure(list(Start_Date = c("2014-05-09", "2015-05-09", "2016-05-11", 
"2017-05-11", "2016-08-29", "2017-08-29"), End_Date = c("2015-05-08", 
"2016-05-08", "2017-05-10", "2018-05-10", "2017-08-28", "2018-08-28"
), ID = c(1L, 1L, 1L, 1L, 2L, 2L)), class = "数据.frame", 
row.names = c(NA, 
-6L))