我在列Segments中有文本数据,在列Q_c7_collpsd中有相应的文本标记组合.Q_c7_collpsdSegments长.我的任务是将Q_c7_collpsd号文件的长度和内容修剪成Segments号文件的精确长度和内容.(另一个困难是Segments包含特殊字符,这些字符在Q_c7_collpsd中找不到).

这是what I've tried so far;它只起到了部分作用(错误的分配用**...*标记)

library(data.table)
library(stringr)
library(dplyr)
library(tidyr)

df %>%

  mutate(
    # trim whitespace:
    Segments = trimws(Segments),
    # correct "n 't" to "n't":
    Segments = str_replace_all(Segments, "(?<=n)\\s(?='t)", ""),
    # separate clitics from host:
    Segments = str_replace_all(Segments, "n't", " n't")) %>%
  # split into Segment_type and Utterance:
  separate(Segments, into = c("Segment_type", "Utterance"), sep = ":\\s") %>%
  # remove special characters:
  mutate(Utterance_Clean = str_remove_all(Utterance, "(?![\\s'])\\W")) %>%
  # remove rows where Utterance_Clean is digits:
  filter(!str_detect(Utterance_Clean, "^\\d+$")) %>%

  # create id:
  group_by(Q_c7_collpsd) %>%
  mutate(id = rleid(Sequ, Utterance)) %>%
  group_by(id) %>%
  # split into individual word-tag combinations:
  separate_rows(Q_c7_collpsd, sep = "\\s") %>%
  separate(Q_c7_collpsd, into = c("w", "c7"), sep = "_") %>%
  
  # split into individual words:
  separate_rows(Utterance_Clean, sep = "\\s") %>%
  group_by(Sequ, id) %>%
  # keep only rows where w == Utterance_Clean:
  filter(w == Utterance_Clean) %>%
  # combine w and c7 tag back again:
  mutate(w_c7 = str_c(w, c7, sep = "_")) %>%
  # put splitted elemments back together:
  summarise(across(c(Segment_type, Utterance), first),
            w_c7 = str_c(w_c7, collapse = " "))

Result:

# A tibble: 6 × 5
# Groups:   Sequ [2]
   Sequ    id Segment_type Utterance                     w_c7                                                        
  <dbl> <int> <chr>        <chr>                         <chr>                                                       
1     1     1 tcu_noQ      °[o::h my God]                oh_UH my_APPGE God_NP1 **my_APPGE*                             
2     1     2 tcu_pol      you have n't seen my place?°= **my_APPGE* you_PPY have_VH0 n't_XX seen_VVN my_APPGE place_NN1
3     2     1 tcu_decl     [but you] use leggings¿       but_CCB you_PPY use_VV0 leggings_NN2                        
4     2     2 frg          or                            or_CC                                                       
5     2     3 tcu_decl     no?                           no_UH                                                       
6     2     4 tcu_noQ      [(I do n't know)]             I_PPIS1 do_VD0 n't_XX know_VVI  

Desired result:

   Sequ    id Segment_type Utterance                     w_c7                                                        
  <dbl> <int> <chr>        <chr>                         <chr>                                                       
1     1     1 tcu_noQ      °[o::h my God]                oh_UH my_APPGE God_NP1                            
2     1     2 tcu_pol      you have n't seen my place?°= you_PPY have_VH0 n't_XX seen_VVN my_APPGE place_NN1
3     2     1 tcu_decl     [but you] use leggings¿       but_CCB you_PPY use_VV0 leggings_NN2                        
4     2     2 frg          or                            or_CC                                                       
5     2     3 tcu_decl     no?                           no_UH                                                       
6     2     4 tcu_noQ      [(I do n't know)]             I_PPIS1 do_VD0 n't_XX know_VVI 

Data:

df <- data.frame(
  Sequ = c(1,1,2,2,2,2,2), 
  Segments = c("tcu_noQ: °[o::h my God] ", "tcu_pol: you haven't seen my place?°=", "tcu_decl: [but you] use leggings¿","frg: or","pause: 0.485","tcu_decl: no?","tcu_noQ: [(I don 't know)]"),
  Q_c7_collpsd = c("oh_UH my_APPGE God_NP1 you_PPY have_VH0 n't_XX seen_VVN my_APPGE place_NN1", "oh_UH my_APPGE God_NP1 you_PPY have_VH0 n't_XX seen_VVN my_APPGE place_NN1", 
                   "but_CCB you_PPY use_VV0 leggings_NN2 or_CC no_UH I_PPIS1 do_VD0 n't_XX know_VVI","but_CCB you_PPY use_VV0 leggings_NN2 or_CC no_UH I_PPIS1 do_VD0 n't_XX know_VVI","but_CCB you_PPY use_VV0 leggings_NN2 or_CC no_UH I_PPIS1 do_VD0 n't_XX know_VVI","but_CCB you_PPY use_VV0 leggings_NN2 or_CC no_UH I_PPIS1 do_VD0 n't_XX know_VVI","but_CCB you_PPY use_VV0 leggings_NN2 or_CC no_UH I_PPIS1 do_VD0 n't_XX know_VVI"))

推荐答案

我想这个就行了.

说明:直到标有多个#####...的台阶,一切都是一样的.如果你在那里看到输出,我们已经 for each Sequ清理了一行发音.因此,我提议在那里为每一个干净的话语创建一个ID.现在我们的工作是找到相关的Q7_c_collpsd.所以我没有再次使用separate_rowsSeparate,而是使用了purrr::map_chr,而不是提取w_c7.

map_chr个部分的进一步解释

  • 第一个参数是Q7_c_collpsd除以一个空格(与使用separate_rows的方式相同)
  • 用于映射的第二个参数id ID
  • 函数部分只是提取第一个参数的第ID部分.就像你在过滤部分所做的那样.
library(tidyverse)

df %>% 
  mutate(
    # trim whitespace:
    Segments = trimws(Segments),
    # correct "n 't" to "n't":
    Segments = str_replace_all(Segments, "(?<=n)\\s(?='t)", ""),
    # separate clitics from host:
    Segments = str_replace_all(Segments, "n't", " n't")
  ) %>%
  # split into Segment_type and Utterance:
  separate(Segments,
           into = c("Segment_type", "Utterance"),
           sep = ":\\s") %>%
  # remove special characters:
  mutate(Utterance_Clean = str_remove_all(Utterance, "(?![\\s'])\\W")) %>%
  # remove rows where Utterance_Clean is digits:
  filter(!str_detect(Utterance_Clean, "^\\d+$")) %>%
  separate_rows(Utterance_Clean, sep = "\\s") %>% 
  #####################################################
  # Create ID here
  mutate(id = row_number(), .by = Sequ) %>% 
  mutate(Q_new = map2_chr(str_split(Q_c7_collpsd, "\\s"), id,~ .x[[.y]])) %>% #View
  summarise(
    w_c7 = str_c(Q_new, collapse = " "),
    .by = c(Sequ, Segment_type, Utterance)
  )
#> # A tibble: 6 × 4
#>    Sequ Segment_type Utterance                     w_c7                         
#>   <dbl> <chr>        <chr>                         <chr>                        
#> 1     1 tcu_noQ      °[o::h my God]                oh_UH my_APPGE God_NP1       
#> 2     1 tcu_pol      you have n't seen my place?°= you_PPY have_VH0 n't_XX seen…
#> 3     2 tcu_decl     [but you] use leggings¿       but_CCB you_PPY use_VV0 legg…
#> 4     2 frg          or                            or_CC                        
#> 5     2 tcu_decl     no?                           no_UH                        
#> 6     2 tcu_noQ      [(I do n't know)]             I_PPIS1 do_VD0 n't_XX know_V…

创建于2024-03-06年第reprex v2.0.2


PS:dplyr 1.1.0以后的版本获得了一个与data.table::rleid()完全一样的函数consecutive_id(),所以,不需要只为一个函数加载这个包,这个函数无论如何都没有在修订的答案中使用过.

R相关问答推荐

是否有任何解决方案可以优化VSCode中RScript的图形绘制?

从具有随机模式的字符串中提取值

使用Shiny组合和显示复制和粘贴的数据

咕噜中的元素列表:map

根据R中两个变量的两个条件删除带有dspirr的行

通过使用str_detect对具有相似字符串的组进行分组

如何使用STAT_SUMMARY向ggplot2中的密度图添加垂直线

矩阵的堆叠条形图,条形图上有数字作为标签

给定开始日期和月份(数字),如何根据R中的开始日期和月数创建日期列

R -在先前group_by级别汇总时获取最大大小子组的计数

是否有新方法来更改Facet_WRAP(Ggplot2)中条文本的文本 colored颜色 ?

R如何将列名转换为更好的年和月格式

如何删除设置大小的曲线图并添加条形图顶部数字的百分比

如何阻止围堵地理密度图?

有没有办法将基于每个值中出现的两个关键字或短语的字符串向量重新编码为具有这两个值的新向量?

使用ifElse语句在ggploy中设置aes y值

为什么将负值向量提升到分数次方会得到NaN

子样本间系数检验的比较

对一个数据帧中另一个数据帧中的值进行计数

如何根据顺序/序列从数据框中排除值