I am working in R个
我的数据中有500,000行,但这里使用的是一个小示例.
我有一些学校工作人员的数据.有些人在一所学校工作,有些人在两所学校工作,有些人在三所学校工作,等等.
学校并不总是记录个人的名字.一所学校记录为威尔,另一所学校记录为威廉.
我还有一个假设:对于一个在不止一所学校工作的人来说,他们的第二个名字和出生日期总是在每一所学校被记录下来.
基于他们名字的相似性,我想要一种方法来识别可能是同一个人的人,然后给他们分配一个ID.
虽然格雷格和格里芬的前两个字母相同,但他们很可能不是同一个人,这两个词之间会有某种隔阂.
sample data:个
data_current <- data.frame(first_name = c("will", "william", "william", "laura", "jessica", "jessicalouise", "james", "greg", "griffin"),
last_name = c("smith", "smith", "smith", "maxwell", "maxwell", "maxwell", "lead", "jones", "jones"),
date_of_birth = c("2000-01-02","2000-01-02", "2000-01-02", "2007-01-02","2007-01-02","2007-01-02","1999-01-02","2004-01-02","2004-01-02"),
school_id = c(1, 2, 3, 4, 5, 6, 7, 8, 9))
first_name | second_name | date_of_birth | school_id |
---|---|---|---|
will | smith | 2000-01-02 | 1 |
william | smith | 2000-01-02 | 2 |
william | smith | 2000-01-02 | 3 |
laura | maxwell | 2007-01-02 | 4 |
jessica | maxwell | 2007-01-02 | 5 |
jessicalouise | maxwell | 2007-01-02 | 6 |
james | lead | 1999-01-02 | 7 |
greg | jones | 2004-01-02 | 8 |
griffin | jones | 2004-01-02 | 9 |
desired data:个
前三个人很可能是同一个人,因此分配了相同的Person_id,依此类推……
data_desired <- data.frame(first_name = c("will", "william", "william", "laura", "jessica", "jessicalouise", "james", "greg", "griffin"),
last_name = c("smith", "smith", "smith", "maxwell", "maxwell", "maxwell", "lead", "jones", "jones"),
date_of_birth = c("2000-01-02","2000-01-02", "2000-01-02", "2007-01-02","2007-01-02","2007-01-02","1999-01-02","2004-01-02","2004-01-02"),
school_id = c(1, 2, 3, 4, 5, 6, 7, 8, 9),
person_id = c(1, 1, 1, 2, 3, 3, 4, 5, 6))
first_name | second_name | date_of_birth | school_id | person_id |
---|---|---|---|---|
will | smith | 2000-01-02 | 1 | 1 |
william | smith | 2000-01-02 | 2 | 1 |
william | smith | 2000-01-02 | 3 | 1 |
laura | maxwell | 2007-01-02 | 4 | 2 |
jessica | maxwell | 2007-01-02 | 5 | 3 |
jessicalouise | maxwell | 2007-01-02 | 6 | 3 |
james | lead | 1999-01-02 | 7 | 4 |
greg | jones | 2004-01-02 | 8 | 5 |
griffin | jones | 2004-01-02 | 9 | 6 |
有没有人有解决这个问题的建议?