我有一个数据帧,其中包含一个列,其中包含一些自由 Select 的"代码"(例如"abcde"、"blabla").由于数据来自在线调查,有时用户会错误地输入他们应该输入的代码.例如,他们写的是"bacde"而不是"abcde".
library(RecordLinkage)
library(tidyverse)
# Sample data
df <- data.frame(code = c(rep("abcde", 20), "bbcde", "abccde", rep("efghi", 20), "efigh", "efghj"))
我编写了一个函数correctCodes()
,它返回"更正"的代码.它使用levenshtein距离来计算两个字符串之间的相似度.只有当至少有两个"参考代码"并且相似度超过某个阈值(.8)时,代码才会被更正
# function that calculates similarity between two strings
mylevsim = function (str1, str2) {
return(1 - (RecordLinkage::levenshteinDist(str1, str2)/max(nchar(str1), nchar(str2))))
}
# Function that returns the corrected code
correctCodes <- function(wrongCode) {
wrongCode <- trimws(toupper(wrongCode))
counts <- df %>%
mutate(code = trimws(toupper(code))) %>%
group_by(code) %>%
summarise(n = n(),
Var2 = code,
code = NULL) %>%
filter(!duplicated(code))
expand.grid(trimws(toupper(df$code)), trimws(toupper(df$code)), stringsAsFactors = FALSE) %>%
mutate(similarity = mylevsim(Var1, Var2)) %>%
arrange(Var1, desc(similarity)) %>%
left_join(counts, by = "Var2") %>%
mutate(both = paste(Var1, Var2)) %>%
filter(!duplicated(both),
Var1 != Var2,
Var1 == wrongCode,
n > 2,
similarity > .8) %>%
filter(row_number()==1) %>%
pull(Var2)
}
现在,如果我像这样调用函数(该函数返回"ABCDE"),就可以很好地工作:
correctCodes("abccde") # returns ABCDE
correctCodes("bbcde") # returns ABCDE
correctCodes("efghj") # returns EFGHI
然而,如果我想在dplyr::mutate()
以内使用它,它根本不起作用:
df %>%
mutate(code_corrected = correctCodes(code))