我有一个数据集,里面有各种药物对不同细菌的敏感性. 我想得到按生物体分类的易感频率.有没有一种方法可以简化这一过程,而不是为每种药物复制/粘贴? 我在考虑使用apply,或者可能编写一个函数,但不确定从哪里开始.

pacman::p_load(tidyverse,
               janitor)

demo_dat <- data.frame(
  stringsAsFactors = FALSE,
                 organism_name = c("Klebsiella pneumonia","Klebsiella pneumonia",
                                   "Escherichia coli","Klebsiella pneumonia",
                                   "Enterobacter cloacae","Escherichia coli",
                                   "Klebsiella pneumonia","Escherichia coli",
                                   "Escherichia coli","Escherichia coli",
                                   "Klebsiella pneumonia","Klebsiella pneumonia",
                                   "Escherichia coli","Klebsiella pneumonia",
                                   "Escherichia coli","Serratia marcenscens",
                                   "Klebsiella oxytoca","Escherichia coli",
                                   "Proteus mirabilis","Escherichia coli"),
                  amox_clav_po = c("S",
                                   "S","S","I","R","I","S","I","R","I",
                                   "S","S","S","S","I","R","S","S","S",
                                   "R"),
                    amp_sul_iv = c("S",
                                   "I","S","S","R","R","S","S","R","I",
                                   "S","I","S","I","R","R","S","S","S",
                                   "R"),
                   cefaclor_po = c("S",
                                   "S","S","S","R","S","S","S","S","S",
                                   "S","S","S","S","R","R","S","S","S",
                                   "S"),
                ceftriaxone_iv = c("S",
                                   "S","S","S","S","S","S","S","S","S",
                                   "S","S","S","S","R","S","S","S","S",
                                   "S")
            )

demo_dat |> 
  group_by(organism_name) |> 
  summarise(susceptibility = sum((amox_clav_po == "S")/n()))
#> # A tibble: 6 × 2
#>   organism_name        susceptibility
#>   <chr>                         <dbl>
#> 1 Enterobacter cloacae          0    
#> 2 Escherichia coli              0.333
#> 3 Klebsiella oxytoca            1    
#> 4 Klebsiella pneumonia          0.857
#> 5 Proteus mirabilis             1    
#> 6 Serratia marcenscens          0

demo_dat |> 
  group_by(organism_name) |> 
  summarise(susceptibility = sum((amp_sul_iv == "S")/n()))
#> # A tibble: 6 × 2
#>   organism_name        susceptibility
#>   <chr>                         <dbl>
#> 1 Enterobacter cloacae          0    
#> 2 Escherichia coli              0.444
#> 3 Klebsiella oxytoca            1    
#> 4 Klebsiella pneumonia          0.571
#> 5 Proteus mirabilis             1    
#> 6 Serratia marcenscens          0

创建于2024-01-29年第reprex v2.0.2

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.2 (2022-10-31)
#>  os       macOS Big Sur ... 10.16
#>  system   x86_64, darwin17.0
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       America/Phoenix
#>  date     2024-01-29
#>  pandoc   3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package       * version date (UTC) lib source
#>  assertthat      0.2.1   2019-03-21 [1] CRAN (R 4.2.0)
#>  backports       1.4.1   2021-12-13 [1] CRAN (R 4.2.0)
#>  broom           1.0.1   2022-08-29 [1] CRAN (R 4.2.0)
#>  cellranger      1.1.0   2016-07-27 [1] CRAN (R 4.2.0)
#>  cli             3.6.1   2023-03-23 [1] CRAN (R 4.2.0)
#>  colorspace      2.0-3   2022-02-21 [1] CRAN (R 4.2.0)
#>  crayon          1.5.2   2022-09-29 [1] CRAN (R 4.2.0)
#>  DBI             1.1.3   2022-06-18 [1] CRAN (R 4.2.0)
#>  dbplyr          2.2.1   2022-06-27 [1] CRAN (R 4.2.0)
#>  digest          0.6.30  2022-10-18 [1] CRAN (R 4.2.0)
#>  dplyr         * 1.1.2   2023-04-20 [1] CRAN (R 4.2.0)
#>  ellipsis        0.3.2   2021-04-29 [1] CRAN (R 4.2.0)
#>  evaluate        0.17    2022-10-07 [1] CRAN (R 4.2.0)
#>  fansi           1.0.3   2022-03-24 [1] CRAN (R 4.2.0)
#>  fastmap         1.1.0   2021-01-25 [1] CRAN (R 4.2.0)
#>  forcats       * 0.5.2   2022-08-19 [1] CRAN (R 4.2.0)
#>  fs              1.5.2   2021-12-08 [1] CRAN (R 4.2.0)
#>  gargle          1.2.1   2022-09-08 [1] CRAN (R 4.2.0)
#>  generics        0.1.3   2022-07-05 [1] CRAN (R 4.2.0)
#>  ggplot2       * 3.3.6   2022-05-03 [1] CRAN (R 4.2.0)
#>  glue            1.6.2   2022-02-24 [1] CRAN (R 4.2.0)
#>  googledrive     2.0.0   2021-07-08 [1] CRAN (R 4.2.0)
#>  googlesheets4   1.0.1   2022-08-13 [1] CRAN (R 4.2.0)
#>  gtable          0.3.1   2022-09-01 [1] CRAN (R 4.2.0)
#>  haven           2.5.1   2022-08-22 [1] CRAN (R 4.2.0)
#>  highr           0.9     2021-04-16 [1] CRAN (R 4.2.0)
#>  hms             1.1.2   2022-08-19 [1] CRAN (R 4.2.0)
#>  htmltools       0.5.3   2022-07-18 [1] CRAN (R 4.2.0)
#>  httr            1.4.4   2022-08-17 [1] CRAN (R 4.2.0)
#>  janitor       * 2.1.0   2021-01-05 [1] CRAN (R 4.2.0)
#>  jsonlite        1.8.3   2022-10-21 [1] CRAN (R 4.2.0)
#>  knitr           1.40    2022-08-24 [1] CRAN (R 4.2.0)
#>  lifecycle       1.0.3   2022-10-07 [1] CRAN (R 4.2.0)
#>  lubridate       1.9.0   2022-11-06 [1] CRAN (R 4.2.0)
#>  magrittr        2.0.3   2022-03-30 [1] CRAN (R 4.2.0)
#>  modelr          0.1.9   2022-08-19 [1] CRAN (R 4.2.0)
#>  munsell         0.5.0   2018-06-12 [1] CRAN (R 4.2.0)
#>  pacman          0.5.1   2019-03-11 [1] CRAN (R 4.2.0)
#>  pillar          1.9.0   2023-03-22 [1] CRAN (R 4.2.0)
#>  pkgconfig       2.0.3   2019-09-22 [1] CRAN (R 4.2.0)
#>  purrr         * 1.0.1   2023-01-10 [1] CRAN (R 4.2.0)
#>  R6              2.5.1   2021-08-19 [1] CRAN (R 4.2.0)
#>  readr         * 2.1.3   2022-10-01 [1] CRAN (R 4.2.0)
#>  readxl          1.4.1   2022-08-17 [1] CRAN (R 4.2.0)
#>  reprex          2.0.2   2022-08-17 [1] CRAN (R 4.2.0)
#>  rlang           1.1.1   2023-04-28 [1] CRAN (R 4.2.0)
#>  rmarkdown       2.17    2022-10-07 [1] CRAN (R 4.2.0)
#>  rstudioapi      0.14    2022-08-22 [1] CRAN (R 4.2.0)
#>  rvest           1.0.3   2022-08-19 [1] CRAN (R 4.2.0)
#>  scales          1.2.1   2022-08-20 [1] CRAN (R 4.2.0)
#>  sessioninfo     1.2.2   2021-12-06 [1] CRAN (R 4.2.0)
#>  snakecase       0.11.0  2019-05-25 [1] CRAN (R 4.2.0)
#>  stringi         1.7.8   2022-07-11 [1] CRAN (R 4.2.0)
#>  stringr       * 1.4.1   2022-08-20 [1] CRAN (R 4.2.0)
#>  tibble        * 3.2.1   2023-03-20 [1] CRAN (R 4.2.0)
#>  tidyr         * 1.2.1   2022-09-08 [1] CRAN (R 4.2.0)
#>  tidyselect      1.2.0   2022-10-10 [1] CRAN (R 4.2.0)
#>  tidyverse     * 1.3.2   2022-07-18 [1] CRAN (R 4.2.0)
#>  timechange      0.1.1   2022-11-04 [1] CRAN (R 4.2.0)
#>  tzdb            0.4.0   2023-05-12 [1] CRAN (R 4.2.0)
#>  utf8            1.2.2   2021-07-24 [1] CRAN (R 4.2.0)
#>  vctrs           0.6.2   2023-04-19 [1] CRAN (R 4.2.0)
#>  withr           2.5.0   2022-03-03 [1] CRAN (R 4.2.0)
#>  xfun            0.34    2022-10-18 [1] CRAN (R 4.2.0)
#>  xml2            1.3.3   2021-11-30 [1] CRAN (R 4.2.0)
#>  yaml            2.3.6   2022-10-18 [1] CRAN (R 4.2.0)
#> 
#>  [1] /Library/Frameworks/R.framework/Versions/4.2/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

推荐答案

R2evans和Onyambu为您的问题提供了两个完全合理的答案,但这里有另一个变体,我认为值得一提.我认为这是值得的,因为(1)它展示了tidy度的原则;(2)它是一个通用的解决方案;(3)它潜在地更健壮;(4)至少在我看来,它既更紧凑,也更容易理解--因此也更容易维护.

您的数据集不整齐,因为您需要执行分析的信息(药物名称和给药路由)包含在列名中,而不是DARA框中的值中.

因此,首先要做的是使数据帧整齐:

longData <- demo_dat %>% 
  pivot_longer(
    -organism_name,
    names_to = c("Drug", "Route"),
    values_to = "Susceptibility",
    names_pattern = "(.+)_(.+)$"
  )

longData 
# A tibble: 80 × 4
   organism_name        Drug        Route Susceptibility
   <chr>                <chr>       <chr> <chr>         
 1 Klebsiella pneumonia amox_clav   po    S             
 2 Klebsiella pneumonia amp_sul     iv    S             
 3 Klebsiella pneumonia cefaclor    po    S             
 4 Klebsiella pneumonia ceftriaxone iv    S             
 5 Klebsiella pneumonia amox_clav   po    S             
 6 Klebsiella pneumonia amp_sul     iv    I             
 7 Klebsiella pneumonia cefaclor    po    S             
 8 Klebsiella pneumonia ceftriaxone iv    S             
 9 Escherichia coli     amox_clav   po    S             
10 Escherichia coli     amp_sul     iv    S             
# ℹ 70 more rows

这里唯一有点棘手的事情是从列名中提取药物名称和给药路由.names_pattern = "(.+)_(.+)$"表示我想提取两条信息(用括号中的组表示).第二个字符在字符串($)的末尾结束,由最后一个下划线和字符串末尾之间的所有内容组成.第一个由字符串开头和最后一个下划线之间的所有内容组成.["amox_clav"中的下划线意味着我们需要采取这种略显笨拙的方法.]

现在很容易获得您想要的信息:

longData %>% 
  group_by(organism_name, Drug, Route) %>% 
  summarise(Percentage = (sum(Susceptibility == "S") / n()), .groups = "drop")
# A tibble: 24 × 4
   organism_name        Drug        Route Percentage
   <chr>                <chr>       <chr>      <dbl>
 1 Enterobacter cloacae amox_clav   po         0    
 2 Enterobacter cloacae amp_sul     iv         0    
 3 Enterobacter cloacae cefaclor    po         0    
 4 Enterobacter cloacae ceftriaxone iv         1    
 5 Escherichia coli     amox_clav   po         0.333
 6 Escherichia coli     amp_sul     iv         0.444
 7 Escherichia coli     cefaclor    po         0.889
 8 Escherichia coli     ceftriaxone iv         0.889
 9 Klebsiella oxytoca   amox_clav   po         1    
10 Klebsiella oxytoca   amp_sul     iv         1    
# ℹ 14 more rows

我声称这个解决方案更可靠的原因是,它将适用于药物名称和路由的任何组合,无论格式如何,也无论药物名称和路由的值如何.它也很容易推广到其他分类:例如,使用剂量或配方.

R相关问答推荐

在R中使用自定义函数时如何删除该函数的一部分?

提取R中值和列名的所有可能组合

使用spatVector裁剪网格数据时出现的问题

从gtsummary包中使用tBL_strata()和tBL_summary()时删除变量标签

抖动点与嵌套类别变量箱形图的位置不对齐

随机森林回归:下拉列重要性

在发布到PowerBI Service时,是否可以使用R脚本作为PowerBI的数据源?

Rplotly中的Sankey Diagram:意外连接&

在数组索引上复制矩阵时出错

如何写商,水平线,在一个单元格的表在R

汇总数据的Sheffe检验的P值(平均值和标准差)

根据另一列中的值和条件查找新列的值

在ggplot2上从多个数据框创建复杂的自定义图形

多元正态分布的计算

我正在try 创建一个接近cos(X)的值的While循环,以便它在-或+1-E10范围内

是否从列中删除★符号?

如何将两个用不同的运算符替换*的矩阵相乘

将`magick`对象转换为原始向量

创建由三个单独的shapefile组成的单个 map

使用循环改进功能( struct 简单)