比另一个答案晚了一分钟,但用bind_rows
:
library(tidyverse)
df.old <- data.frame(PID = c(1, 2, 3, 4, NA),
firstname = c("", "Peter", "David", "Jessy", ""),
lastname = c("", "White", "Smith", "Connor", ""),
mail = c("user1@mail.com", "user2@mail.com", NA, "user10@mail.com", NA))
df.new <- data.frame(mail = c("user1@mail.com", "user2@mail.com", NA, NA , NA),
firstname = c("", "", "", "David", ""),
lastname = c("", "", "", "Smith", ""))
bind_rows(
df.new |>
filter(mail != "") |>
left_join(df.old |> select(mail, PID)),
df.new |>
filter(is.na(mail)) |>
left_join(
df.old |> filter(is.na(mail)) |> select(firstname, lastname, PID),
by = join_by(firstname, lastname)
)
)
#> Joining with `by = join_by(mail)`
#> mail firstname lastname PID
#> 1 user1@mail.com 1
#> 2 user2@mail.com 2
#> 3 <NA> NA
#> 4 <NA> David Smith 3
#> 5 <NA> NA
编辑-跳过错误邮箱
对邮箱进行更强大判断的最简单方法是使用正则表达式来测试第一个匹配的邮箱有效性,并在第二个匹配中否定它(并检测NA),然后加入结果:
library(tidyverse)
df.old <- data.frame(PID = c(1, 2, 3, 4, 5, 6),
firstname = c("", "Peter", "David", "Jessy", "Bad", "Gordon"),
lastname = c("", "White", "Smith", "Connor", "Email", "Bennet"),
mail = c("user1@mail.com", "user2@mail.com", NA, "user10@mail.com", NA, "oldemail@mail.com"))
df.new <- data.frame(mail = c("user1@mail.com", "user2@mail.com", NA, "" , "bademail@none", "newemail@mail.com"),
firstname = c("", "", "", "David", "Bad", "Gordon"),
lastname = c("", "", "", "Smith", "Email", "Bennet"))
# Step 1: all valid, present, emails with matching records:
first_join <- df.new |>
filter(str_detect(
mail,
"^\\w+([-+.']\\w+)*@\\w+([-.]\\w+)*\\.\\w+([-.]\\w+)*$"
)) |>
inner_join(df.old |> select(mail, PID), by = join_by(mail))
# Step 2: all remaining records, with invalid emails, missing emails or
# non-matched emails; join to first set
df.new |>
anti_join(first_join, by = join_by(mail)) |>
left_join(df.old |>
anti_join(first_join, by = join_by(mail)) |>
select(firstname, lastname, PID),
by = join_by(firstname, lastname)) |>
bind_rows(first_join, second = _)
#> mail firstname lastname PID
#> 1 user1@mail.com 1
#> 2 user2@mail.com 2
#> 3 <NA> NA
#> 4 David Smith 3
#> 5 bademail@none Bad Email 5
#> 6 newemail@mail.com Gordon Bennet 6
这里的坏先生邮箱不能通过邮箱匹配(因为它不是有效的邮箱地址),所以它被跳过,他的名字匹配.戈登·班纳特先生换了邮箱,所以他的新邮箱找不到,只能通过名字找到.最后,David有一个空字符串email(""
),所以我们跳过了他,并按名称进行匹配.没有姓名、邮箱或PID的不可见人员将按照指定保存在新的网络框架中.