我有两个数据帧df.olddf.new.现在,我想将df.old个数据帧中的id变量添加到邮箱地址匹配的新数据帧中.如果没有可用的邮寄地址,则应使用名字和姓氏(paste(firstname, lastname))进行匹配.如何才能有效地做到这一点呢?

我的第一个猜测是创建两个"查找"--函数get_PID_by_mail(mail)get_PID_by_name(firstname, lastname),将它们矢量化并在df.new %>% mutate(PID=get_PID_by_mail(mail))中应用它们.但事实证明,这有点低效,因为数据帧很大.你会怎么解决这个问题?谢谢!

df.old <- data.frame(PID = c(1, 2, 3, 4, NA),
                     firstname = c("", "Peter", "David", "Jessy", ""),
                     lastname = c("", "White", "Smith", "Connor", ""),
                     mail = c("user1@mail.com", "user2@mail.com", NA, "user10@mail.com", NA))

df.new <- data.frame(mail = c("user1@mail.com", "user2@mail.com", NA, NA , NA),
                     firstname = c("", "", "", "David", ""),
                     lastname = c("", "", "", "Smith", ""))
df.new

预期输出:

df.new
======
   mail                firstname lastname  PID
1  user1@mail.com                          1
2  user2@mail.com                          2
3  <NA>                                    <NA>
4  <NA>                David     Smith     3
5  <NA>                                    <NA>

推荐答案

比另一个答案晚了一分钟,但用bind_rows:

library(tidyverse)
df.old <- data.frame(PID = c(1, 2, 3, 4, NA),
                     firstname = c("", "Peter", "David", "Jessy", ""),
                     lastname = c("", "White", "Smith", "Connor", ""),
                     mail = c("user1@mail.com", "user2@mail.com", NA, "user10@mail.com", NA))

df.new <- data.frame(mail = c("user1@mail.com", "user2@mail.com", NA, NA , NA),
                     firstname = c("", "", "", "David", ""),
                     lastname = c("", "", "", "Smith", ""))

bind_rows(
  df.new |>
    filter(mail != "") |>
    left_join(df.old |> select(mail, PID)),
  
  df.new |>
    filter(is.na(mail)) |>
    left_join(
      df.old |> filter(is.na(mail)) |> select(firstname, lastname, PID),
      by = join_by(firstname, lastname)
    )
)
#> Joining with `by = join_by(mail)`
#>             mail firstname lastname PID
#> 1 user1@mail.com                      1
#> 2 user2@mail.com                      2
#> 3           <NA>                     NA
#> 4           <NA>     David    Smith   3
#> 5           <NA>                     NA

编辑-跳过错误邮箱

对邮箱进行更强大判断的最简单方法是使用正则表达式来测试第一个匹配的邮箱有效性,并在第二个匹配中否定它(并检测NA),然后加入结果:

library(tidyverse)
df.old <- data.frame(PID = c(1, 2, 3, 4, 5, 6),
                     firstname = c("", "Peter", "David", "Jessy", "Bad", "Gordon"),
                     lastname = c("", "White", "Smith", "Connor", "Email", "Bennet"),
                     mail = c("user1@mail.com", "user2@mail.com", NA, "user10@mail.com", NA, "oldemail@mail.com"))

df.new <- data.frame(mail = c("user1@mail.com", "user2@mail.com", NA, "" , "bademail@none", "newemail@mail.com"),
                     firstname = c("", "", "", "David", "Bad", "Gordon"),
                     lastname = c("", "", "", "Smith", "Email", "Bennet"))

# Step 1: all valid, present, emails with matching records:
first_join <- df.new |>
  filter(str_detect(
    mail,
    "^\\w+([-+.']\\w+)*@\\w+([-.]\\w+)*\\.\\w+([-.]\\w+)*$"
  )) |>
  inner_join(df.old |> select(mail, PID), by = join_by(mail))

# Step 2: all remaining records, with invalid emails, missing emails or
# non-matched emails; join to first set
df.new |>
  anti_join(first_join, by = join_by(mail)) |>
  left_join(df.old |> 
              anti_join(first_join, by = join_by(mail)) |> 
              select(firstname, lastname, PID),
            by = join_by(firstname, lastname)) |> 
  bind_rows(first_join, second = _)
#>                mail firstname lastname PID
#> 1    user1@mail.com                      1
#> 2    user2@mail.com                      2
#> 3              <NA>                     NA
#> 4                       David    Smith   3
#> 5     bademail@none       Bad    Email   5
#> 6 newemail@mail.com    Gordon   Bennet   6

这里的坏先生邮箱不能通过邮箱匹配(因为它不是有效的邮箱地址),所以它被跳过,他的名字匹配.戈登·班纳特先生换了邮箱,所以他的新邮箱找不到,只能通过名字找到.最后,David有一个空字符串email(""),所以我们跳过了他,并按名称进行匹配.没有姓名、邮箱或PID的不可见人员将按照指定保存在新的网络框架中.

R相关问答推荐

如何从使用lapply()的r中的拆分数据帧中删除多个部分?

如何计算新变量中的通货inflating 率?

从字符载体创建函数参数

在R中使用自定义函数时如何删除该函数的一部分?

在R中列表的结尾添加数字载体

R等效于LABpascal(n,1)不同的列符号

如何对数据集进行逆向工程?

抖动点与嵌套类别变量箱形图的位置不对齐

R:连接值,而不是变量?

如何直接从R中的风险分数计算c指数?

R-更新面内部的栅格值

如何根据数据帧中的值从该数据帧中提取值?

将标识符赋给事件序列,避免错误观察

如何计算R glm probit中的线性预测因子?

使用ggplot2中的sec_axis()调整次轴

从线的交点创建面

在ggploy中创建GeV分布时出错

如何在GALT包的函数&geom_x样条线中调整线宽

如何在R中创建这些列?

排序R矩阵的行和列