我有一个向量tissue,它包含由多个字符分隔的字符串.向量的组成字符串大致分为四类:

  1. 仅由term(s)(例如Thymus Thyroid)组成的字符串,由,分隔

  2. 包含identifier(s)(例如ECO:0000313|RefSeq:XP_014046664.1)的字符串,以},结尾,term(s),分隔

  3. 包含term后跟identifier(s)的字符串

  4. 字符串,其中包含一个term,后跟一个identifier,然后是term(s),以,分隔

    tissue <- c("Head kidney,Thymus,Thyroid,", 
                "Red blood cell,", 
                "ECO:0000313|RefSeq:XP_014046664.1},Muscle,",
                "ECO:0000313|RefSeq:XP_016683349.1},ECO:0000313|RefSeq:XP_016683354.1},Leaf,", 
                "ECO:0000313|RefSeq:XP_014023833.1},Head kidney,Muscle,White muscle,",
                "Blood,ECO:0000313|RefSeq:XP_017326031.1},",
                "Spleen,ECO:0000313|RefSeq:XP_010844217.1},ECO:0000313|RefSeq:XP_010844218.1},",
                "Brain,ECO:0000313|RefSeq:XP_014030244.1},Head kidney,Muscle,Spleen,White muscle,")
    

对于属于类别1的字符串,我可以使用一个简单的strsplit()函数拆分这些术语

unlist(strsplit("Head kidney,Thymus,Thyroid,", ","))
[1] "Head kidney" "Thymus"      "Thyroid" 

unlist(strsplit("Red blood cell,", ","))
[1] "Red blood cell"

对于属于类别2的字符串,这是我提出的,它工作得很好

unlist(strsplit(sub('.*\\},', "", "ECO:0000313|RefSeq:XP_014046664.1},Muscle,"), ","))
[1] "Muscle"

unlist(strsplit(sub('.*\\},', "", "ECO:0000313|RefSeq:XP_016683349.1},ECO:0000313|RefSeq:XP_016683354.1},Leaf,"), ","))
[1] "Leaf"

unlist(strsplit(sub('.*\\},', "", "ECO:0000313|RefSeq:XP_014023833.1},Head kidney,Muscle,White muscle,"), ","))
[1] "Head kidney"  "Muscle"       "White muscle"

对于属于类别3的字符串,这对我很有用

sub(',ECO:.*', "", "Blood,ECO:0000313|RefSeq:XP_017326031.1},")
[1] "Blood"

sub(',ECO:.*', "", "Spleen,ECO:0000313|RefSeq:XP_010844217.1},ECO:0000313|RefSeq:XP_010844218.1},")
[1] "Spleen"

对于类别4,这是我try 过的,效果很好

unlist(strsplit(sub(',ECO:.*},', ",", "Brain,ECO:0000313|RefSeq:XP_014030244.1},Head kidney,Muscle,Spleen,White muscle,"), ","))
[1] "Brain"        "Head kidney"  "Muscle"       "Spleen"       "White muscle"

我正在寻找一个解决方案,如果可能的话,一个单一的正则表达式,它可以处理所有这些条件,并且可以直接在向量上使用.

推荐答案

我们可以删除一些子字符串,然后使用strsplit

library(stringr)
lapply(strsplit(str_remove_all(tissue, "ECO:[^\\}]+\\}"), ","), 
     function(x) x[nzchar(x)])

-输出

[[1]]
[1] "Head kidney" "Thymus"      "Thyroid"    

[[2]]
[1] "Red blood cell"

[[3]]
[1] "Muscle"

[[4]]
[1] "Leaf"

[[5]]
[1] "Head kidney"  "Muscle"       "White muscle"

[[6]]
[1] "Blood"

[[7]]
[1] "Spleen"

[[8]]
[1] "Brain"        "Head kidney"  "Muscle"       "Spleen"       "White muscle"

或者有整洁的工作流程

library(dplyr)
library(tidyr)
str_remove_all(tissue, "ECO:[^\\}]+\\}") %>% 
  trimws(whitespace = ",+") %>%
  str_replace_all(',{2,}', ",") %>% 
  tibble(col1 = .) %>% 
  tidyr::separate(col1, into = str_c('V', 
    seq(max(str_count(.$col1, ",")) + 1)), sep = ",", fill = "right")

-输出

# A tibble: 8 × 5
  V1             V2          V3           V4     V5          
  <chr>          <chr>       <chr>        <chr>  <chr>       
1 Head kidney    Thymus      Thyroid      <NA>   <NA>        
2 Red blood cell <NA>        <NA>         <NA>   <NA>        
3 Muscle         <NA>        <NA>         <NA>   <NA>        
4 Leaf           <NA>        <NA>         <NA>   <NA>        
5 Head kidney    Muscle      White muscle <NA>   <NA>        
6 Blood          <NA>        <NA>         <NA>   <NA>        
7 Spleen         <NA>        <NA>         <NA>   <NA>        
8 Brain          Head kidney Muscle       Spleen White muscle

或者只用base R

read.csv(text = gsub(",{2,}", ",", trimws(gsub("ECO:[^\\}]+\\}", 
    "", tissue), whitespace = ",+")), header = FALSE, fill = TRUE, sep=",")

R相关问答推荐

如何优化向量的以下条件赋值?

基于R中的间隔扩展数据集行

Ggplot2中geom_tile的动态zoom

TreeNode打印 twig 并为其上色

优化从每个面的栅格中提取值

Conditional documentr::R中数据帧的summarize()

用满足特定列匹配的另一行替换NA行

使用ggplot2绘制具有边缘分布的坡度图

随机生成样本,同时在R内的随机样本中至少包含一次所有值

如何使用ggsurvfit包更改风险表中的标签名称?

如何在分组蜂群小区中正确定位标签

Ggplot2水平线和垂直线的图例图标不匹配

根据部分名称匹配获取多组列的行求和

当R使用c()组合两个向量时会发生什么?

Gg森林未显示其中一个变量的引用组

R直方图存储计算的bin值

如何在网页抓取中自动更改页码?

防止在与coord_flip组合时在每个面中重复轴标签

如何在R中找到两个大小不等的向量之间的字符串匹配?

基于观察次数的滚动总和以及R中每个滚动的开始和结束日期