标识R中多个列中缺少的唯一值

发布于03月22日

我有一个有三列的数据集，理论上应该有相同数量的唯一观测值.

以下是数据样本:

speciesID                          common_name                       species
        s001                        common lizard              Zootoca vivipara
        s002                     social tuco-tuco           Ctenomys sociabilis
        s002                     social tuco-tuco           Ctenomys sociabilis
        s002                     social tuco-tuco           Ctenomys sociabilis
        s002                     social tuco-tuco           Ctenomys sociabilis
        s002                     social tuco-tuco           Ctenomys sociabilis
        s003                           red grouse      Lagopus lagopus scoticus
        s003                           red grouse      Lagopus lagopus scoticus
        s004                                  elk                Cervus elaphus

完整的数据集可以在here处找到.

但是，当我判断唯一观察的数量时，它们并不匹配.

df %>% as_tibble() %>% count(speciesID) %>% nrow() #148 unique values       
df %>% as_tibble() %>% count(common_name) %>% nrow() #150 unique values     
df %>% as_tibble() %>% count(species) %>% nrow() #147 unique values

Is there a way to figure out which where the 2 missing unique values are from the 100 column and the 3 missing unique values are from the 101 column?

理想情况下，我希望能够识别出问题行，以便能够返回原始数据并修复错误(即，应该有150条唯一记录).

我希望有一种方法可以在R中实现这一点，而不是手动判断大约700行数据.

我try 过使用anti_join，但这并不成功.

我在R工作，最舒服的是dplyr.

aa <- readr::read_csv("clean_data_species.csv")[,-1] distinct(aa) |> filter(.by = speciesID, n() > 1) # # A tibble: 6 × 3 # speciesID common_name species # <chr> <chr> <chr> # 1 s011 banner-tailed kangaroo rat Dipodomys spectabilis # 2 s011 dwarf mongoose Helogale parvula # 3 s030 north american red squirrel Tamiasciurus hudsonicus # 4 s030 eurasian red squirrel Sciurus vulgaris # 5 s045 northern spotted owl Strix occidentalis caurina # 6 s045 grey red-backed vole/grey-sided vole Clethrionomys rufocanus distinct(aa) |> filter(.by = common_name, n() > 1) # # A tibble: 2 × 3 # speciesID common_name species # <chr> <chr> <chr> # 1 s015 great tit Parus major # 2 s073 great tit Parus major distinct(aa) |> filter(.by = species, n() > 1) # # A tibble: 8 × 3 # speciesID common_name species # <chr> <chr> <chr> # 1 s015 great tit Parus major # 2 s020 gray jay Perisoreus canadensis # 3 s073 great tit Parus major # 4 s074 pied babbler Turdoides bicolor # 5 s109 eurasian kestrel Falco tinnunculus # 6 s110 southern pied babblers Turdoides bicolor # 7 s129 canada jay Perisoreus canadensis # 8 s106 common kestrel Falco tinnunculus

标识R中多个列中缺少的唯一值

推荐答案

R相关问答推荐

在边界外添加注释或标题

从有序数据中随机抽样

更改Heatmap Annotation对象的名称

在垂直轴中包含多个ggplot2图中的平均值

如何在RMarkdown LaTex PDF输出中包含英语和阿拉伯语？

过滤器数据.基于两列的帧行和R中的外部向量

使用case_match()和char数组重新编码值

如何将移除事件分配给动态创建的按钮？

是否可以创建一个ggplot与整洁判断的交互作用

仅在Facet_WRAP()中的相应方面包含geom_abline()

查找所有站点的最小值

R -如何分配夜间GPS数据(即跨越午夜的数据)相同的开始日期？

使用R将简单的JSON解析为嵌套框架

带RStatix的Wilcoxon环内检验

使用geom_sf跨越日期线时的闭合边界

如果满足条件，则替换列的前一个值和后续值

R：如何在数据集中使用Apply

从单个html段落中提取键-值对

为什么R列名称忽略具有指定名称的向量，而只关注索引？

在子图内和子图之间对齐行数不均匀的表格罗布对