我正在对一些数据进行分组,比如实体数据.我已经找到了基于一些实体属性的组,如下所示:
df <- data.frame(uniq_index.x = c(1426, 1426, 1426, 1426, 7796, 7796, 7796, 7796,
7159, 7159, 7159, 7159, 7857, 7857, 7857, 7857,
7158, 7158, 7158, 7158, 5440, 9861, 1641, 8685,
1644, 7525, 6030, 5672),
uniq_index.y = c(7796, 7159, 7857, 7158, 1426, 7159, 7857, 7158,
1426, 7796, 7857, 7158, 1426, 7796, 7159, 7158,
1426, 7796, 7159, 7857, 9861, 5440, 8685, 1641,
7525, 1644, 5673, 6030)
)
# grouping
a <- df %>%
group_by(uniq_index.x) %>%
group_split
从以上数据来看,"1426"、"7796"、"7159"、"7877"、"7158"应该是同一组;5672、5673、6030应该是另一组.我可以用group_by
和group_split
来获得群体.
但是,如果有重复的组,我使用以下代码获取唯一的组:
# initial an empty dataframe
b <- data.frame(V1 = character())
# loop through a (which is obtained from group_split)
for (i in 1:length(a)) {
x <- a[[i]][,1]
y <- a[[i]][,2]
x <- x %>%
mutate(uniq_index = uniq_index.x) %>%
select(uniq_index)
y <- y %>%
mutate(uniq_index = uniq_index.y) %>%
select(uniq_index)
z <- unique(x) %>%
rbind(y) %>%
arrange(uniq_index)
b <- b %>%
rbind(paste(z))
}
# unique groups
b <- b %>%
unique() %>%
mutate(
uniq_agency_id = 100000 + 1:nrow(unique(b))
)
然后,我注意到了这个问题:
与样本数据相似,(6030,5672)和(5673,6030)是两个独立的组.这两个群体应该是一个大群体.
我正在努力想一个逻辑来获得组合独特的群体.