我有一个向量tissue
,它包含由多个字符分隔的字符串.向量的组成字符串大致分为四类:
-
仅由term(s)(例如
Thymus
Thyroid
)组成的字符串,由,
分隔 -
包含identifier(s)(例如
ECO:0000313|RefSeq:XP_014046664.1
)的字符串,以},
结尾,term(s)以,
分隔 -
包含term后跟identifier(s)的字符串
-
字符串,其中包含一个term,后跟一个identifier,然后是term(s),以
,
分隔tissue <- c("Head kidney,Thymus,Thyroid,", "Red blood cell,", "ECO:0000313|RefSeq:XP_014046664.1},Muscle,", "ECO:0000313|RefSeq:XP_016683349.1},ECO:0000313|RefSeq:XP_016683354.1},Leaf,", "ECO:0000313|RefSeq:XP_014023833.1},Head kidney,Muscle,White muscle,", "Blood,ECO:0000313|RefSeq:XP_017326031.1},", "Spleen,ECO:0000313|RefSeq:XP_010844217.1},ECO:0000313|RefSeq:XP_010844218.1},", "Brain,ECO:0000313|RefSeq:XP_014030244.1},Head kidney,Muscle,Spleen,White muscle,")
对于属于类别1的字符串,我可以使用一个简单的strsplit()
函数拆分这些术语
unlist(strsplit("Head kidney,Thymus,Thyroid,", ","))
[1] "Head kidney" "Thymus" "Thyroid"
unlist(strsplit("Red blood cell,", ","))
[1] "Red blood cell"
对于属于类别2的字符串,这是我提出的,它工作得很好
unlist(strsplit(sub('.*\\},', "", "ECO:0000313|RefSeq:XP_014046664.1},Muscle,"), ","))
[1] "Muscle"
unlist(strsplit(sub('.*\\},', "", "ECO:0000313|RefSeq:XP_016683349.1},ECO:0000313|RefSeq:XP_016683354.1},Leaf,"), ","))
[1] "Leaf"
unlist(strsplit(sub('.*\\},', "", "ECO:0000313|RefSeq:XP_014023833.1},Head kidney,Muscle,White muscle,"), ","))
[1] "Head kidney" "Muscle" "White muscle"
对于属于类别3的字符串,这对我很有用
sub(',ECO:.*', "", "Blood,ECO:0000313|RefSeq:XP_017326031.1},")
[1] "Blood"
sub(',ECO:.*', "", "Spleen,ECO:0000313|RefSeq:XP_010844217.1},ECO:0000313|RefSeq:XP_010844218.1},")
[1] "Spleen"
对于类别4,这是我try 过的,效果很好
unlist(strsplit(sub(',ECO:.*},', ",", "Brain,ECO:0000313|RefSeq:XP_014030244.1},Head kidney,Muscle,Spleen,White muscle,"), ","))
[1] "Brain" "Head kidney" "Muscle" "Spleen" "White muscle"
我正在寻找一个解决方案,如果可能的话,一个单一的正则表达式,它可以处理所有这些条件,并且可以直接在向量上使用.