我有一个在data frame中查找匹配项的函数(忽略T2行,它是"关闭的")

library(stringr)

find.all.matches <- function(search.col,pattern){
  captured <- str_match_all(search.col,pattern = pattern)
  t <- lapply(captured, str_trim)
  #t2 <- lapply(t, function(x) gsub("[^a-z]","",x)) ##turned off
  t3 <- sapply(t, unique)
  t4 <- lapply(t3, toString)
  found.col <- unlist(t4)
  return(found.col)
}

我在一个大约20,000行的大型数据集中的特定列上运行这段代码.该栏目是科学期刊的摘要.

我使用以下代码将从pattern中提取的单词作为新列添加到数据框中

testing2 <- find.all.matches(search.col = all_data$abstract_l, 
                             pat = pattern)

all_data$testing_mu_m <- testing2

这是目前的模式……

pattern = '\\d+(?:[.,]\\d+)*\\s*mu m\\b|ba\\b'

在下面的摘要示例中,它将突出显示mu m之前的所有数字以及ba

a protocol for in vitro propagation of adult lavandula dentata plants has been achieved. cultures were established by placing nodal segments on murashige and skoog medium containing ba, kin, and naa. highest shoot multiplication rates were obtained when explants grown in the presence of 5.0 mu m ba or 20 mu m kin were transferred to medium with 8.8 mu m ba and 15% coconut milk. multiplication efficiency through subcultures was significantly affected by the cytokinin concentration in the initial culture medium. subculture reduced drastically the final number of shoots produced on nodal segments isolated from shoots grown in the presence of 2.0 mu m ba or 40.0 mu m kin. shoots were easily rooted on murashige and skoog hormone-free medium with macronutrients at half-strength. plants were successfully transplanted into soil. 

我在想,有没有办法把一个包含ba个句子的完整句子抽出来呢? 我想要一台pattern,我可以把它插到find.all.matches功能上. 所需输出:cultures were established by placing nodal segments on murashige and skoog medium containing ba, kin, and naahighest shoot multiplication rates were obtained when explants grown in the presence of 5.0 mu m ba or 20 mu m kin were transferred to medium with 8.8 mu m ba and 15% coconut milksubculture reduced drastically the final number of shoots produced on nodal segments isolated from shoots grown in the presence of 2.0 mu m ba or 40.0 mu m kin.

推荐答案

您可以使用此正则表达式匹配包含ba个的整个句子:

(?<=^|\. )(?:(?!\.(?: |$)).)*?\bba\b.*?\.(?= |$)

它与以下各项匹配:

  • (?<=^|\. ):句子的开始(字符位置在字符串开头或. 之前)
  • (?:(?!\.(?: |$)).)*?:最小数量的字符,其中没有一个字符是.后跟空格或字符串结尾(tempered greedy token)
  • \bba\b:ba这个词
  • .*?\.(?= |$):最小数量的字符,后跟.和空格或字符串结尾.

regex101上的Regex演示

请注意,要在R中使用它,您需要将所有反斜杠都加倍,即

(?<=^|\\. )(?:(?!\\.(?: |$)).)*?\\bba\\b.*?\\.(?= |$)

R相关问答推荐

用R从API中提取数据

r带有参考年的两年移动平均线

在图内移动y轴上的标签

棒棒糖图表大小和线宽参数故障标签未出现

在特定列上滞后n行,同时扩展框架的长度

如何使用stat_extract_all正确提取我的目标值?

R中具有gggplot 2的Likert图,具有不同的排名水平和显示百分比

从gtsummary包中使用tBL_strata()和tBL_summary()时删除变量标签

带有叠加饼图系列的Highmap

如何在所有绘图中保持条件值的 colored颜色 相同?

找出二叉树中每个 node 在R中的深度?

如何将R中数据帧中的任何Nas替换为最后4个值

从非重叠(非滚动)周期中的最新数据向后开窗并在周期内计数

在R中使用列表(作为tibble列)进行向量化?

在R中,如何从一系列具有索引名的变量快速创建数据帧?

使用gt_summary是否有一种方法来限制每个变量集进行配对比较?

层次树图的数据树

数值型数据与字符混合时如何进行绑定

有毒元素与表观遗传年龄的回归模型

具有自定义仓位限制和计数的GGPLATE直方图