我试图通过按x分组来减少数据帧中的记录数,然后根据组中列y和z的值有条件地执行过滤.
以下是目前为止我所掌握的:
# Required packages
library(tidyverse)
# Data Filtering
job_history <- data.frame(x = c(1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4),
y = c("Hire", "Data Change", "Leave of Absence", "Hire", "Termination", "Hire", "Termination", "Rehire", "Transfer", "Hire", "Termination", "Rehire", "Termination"),
z = as.Date(c("2024-01-01", "2024-02-01", "2024-03-01", "2024-01-01", "2024-02-01", "2024-01-01", "2024-02-01", "2024-03-01", "2024-04-01", "2024-01-01", "2024-02-01", "2024-03-01", "2024-04-01")))
# Group_by and conditionally filter
job_history %>%
group_by(x) %>%
# Count the number of hire, termination, and rehire records
mutate(hires = sum(y == "Hire"),
terms = sum(y == "Termination"),
rehires = sum(y == "Rehire")) %>%
case_when(
# If there is 1 hire record and no terms or rehires, filter for the latest record
(hires = 1 & terms = 0 & rehires = 0) ~ filter(z = max(z)),
# If there is 1 hire and 1 term record, filter for the termination record
(hires = 1 & terms = 1 & rehires = 0) ~ filter(y == "Termination"),
# If there is 1 of each record, filter for the term record and the latest record
(hires = 1 & terms = 1 & rehires = 1) ~ filter(y == "Termination") | z = max(z)),
# If there is 1 hire, 2 term, and 1 rehire record, filter for both termination records
(hires = 1 & terms = 2 & rehires = 1) ~ filter(y == "Termination")
)
期望的输出如下:
x y z
1 Leave of Absence 2024-03-01
2 Termination 2024-02-01
3 Termination 2024-02-01
3 Transfer 2024-04-01
4 Termination 2024-02-01
4 Termination 2024-04-01