R 将一个字符串向量调整为与其他字符串向量完全相同的大小

发布于03月06日

我在列Segments中有文本数据，在列Q_c7_collpsd中有相应的文本标记组合.Q_c7_collpsd比Segments长.我的任务是将Q_c7_collpsd号文件的长度和内容修剪成Segments号文件的精确长度和内容.(另一个困难是Segments包含特殊字符，这些字符在Q_c7_collpsd中找不到).

这是what I've tried so far；它只起到了部分作用(错误的分配用**...*标记)

library(data.table)
library(stringr)
library(dplyr)
library(tidyr)

df %>%

  mutate(
    # trim whitespace:
    Segments = trimws(Segments),
    # correct "n 't" to "n't":
    Segments = str_replace_all(Segments, "(?<=n)\\s(?='t)", ""),
    # separate clitics from host:
    Segments = str_replace_all(Segments, "n't", " n't")) %>%
  # split into Segment_type and Utterance:
  separate(Segments, into = c("Segment_type", "Utterance"), sep = ":\\s") %>%
  # remove special characters:
  mutate(Utterance_Clean = str_remove_all(Utterance, "(?![\\s'])\\W")) %>%
  # remove rows where Utterance_Clean is digits:
  filter(!str_detect(Utterance_Clean, "^\\d+$")) %>%

  # create id:
  group_by(Q_c7_collpsd) %>%
  mutate(id = rleid(Sequ, Utterance)) %>%
  group_by(id) %>%
  # split into individual word-tag combinations:
  separate_rows(Q_c7_collpsd, sep = "\\s") %>%
  separate(Q_c7_collpsd, into = c("w", "c7"), sep = "_") %>%
  
  # split into individual words:
  separate_rows(Utterance_Clean, sep = "\\s") %>%
  group_by(Sequ, id) %>%
  # keep only rows where w == Utterance_Clean:
  filter(w == Utterance_Clean) %>%
  # combine w and c7 tag back again:
  mutate(w_c7 = str_c(w, c7, sep = "_")) %>%
  # put splitted elemments back together:
  summarise(across(c(Segment_type, Utterance), first),
            w_c7 = str_c(w_c7, collapse = " "))

Result:个

# A tibble: 6 × 5
# Groups:   Sequ [2]
   Sequ    id Segment_type Utterance                     w_c7                                                        
  <dbl> <int> <chr>        <chr>                         <chr>                                                       
1     1     1 tcu_noQ      °[o::h my God]                oh_UH my_APPGE God_NP1 **my_APPGE*                             
2     1     2 tcu_pol      you have n't seen my place?°= **my_APPGE* you_PPY have_VH0 n't_XX seen_VVN my_APPGE place_NN1
3     2     1 tcu_decl     [but you] use leggings¿       but_CCB you_PPY use_VV0 leggings_NN2                        
4     2     2 frg          or                            or_CC                                                       
5     2     3 tcu_decl     no?                           no_UH                                                       
6     2     4 tcu_noQ      [(I do n't know)]             I_PPIS1 do_VD0 n't_XX know_VVI

Desired result:个

   Sequ    id Segment_type Utterance                     w_c7                                                        
  <dbl> <int> <chr>        <chr>                         <chr>                                                       
1     1     1 tcu_noQ      °[o::h my God]                oh_UH my_APPGE God_NP1                            
2     1     2 tcu_pol      you have n't seen my place?°= you_PPY have_VH0 n't_XX seen_VVN my_APPGE place_NN1
3     2     1 tcu_decl     [but you] use leggings¿       but_CCB you_PPY use_VV0 leggings_NN2                        
4     2     2 frg          or                            or_CC                                                       
5     2     3 tcu_decl     no?                           no_UH                                                       
6     2     4 tcu_noQ      [(I do n't know)]             I_PPIS1 do_VD0 n't_XX know_VVI

Data:个

df <- data.frame(
  Sequ = c(1,1,2,2,2,2,2), 
  Segments = c("tcu_noQ: °[o::h my God] ", "tcu_pol: you haven't seen my place?°=", "tcu_decl: [but you] use leggings¿","frg: or","pause: 0.485","tcu_decl: no?","tcu_noQ: [(I don 't know)]"),
  Q_c7_collpsd = c("oh_UH my_APPGE God_NP1 you_PPY have_VH0 n't_XX seen_VVN my_APPGE place_NN1", "oh_UH my_APPGE God_NP1 you_PPY have_VH0 n't_XX seen_VVN my_APPGE place_NN1", 
                   "but_CCB you_PPY use_VV0 leggings_NN2 or_CC no_UH I_PPIS1 do_VD0 n't_XX know_VVI","but_CCB you_PPY use_VV0 leggings_NN2 or_CC no_UH I_PPIS1 do_VD0 n't_XX know_VVI","but_CCB you_PPY use_VV0 leggings_NN2 or_CC no_UH I_PPIS1 do_VD0 n't_XX know_VVI","but_CCB you_PPY use_VV0 leggings_NN2 or_CC no_UH I_PPIS1 do_VD0 n't_XX know_VVI","but_CCB you_PPY use_VV0 leggings_NN2 or_CC no_UH I_PPIS1 do_VD0 n't_XX know_VVI"))

library(tidyverse) df %>% mutate( # trim whitespace: Segments = trimws(Segments), # correct "n 't" to "n't": Segments = str_replace_all(Segments, "(?<=n)\\s(?='t)", ""), # separate clitics from host: Segments = str_replace_all(Segments, "n't", " n't") ) %>% # split into Segment_type and Utterance: separate(Segments, into = c("Segment_type", "Utterance"), sep = ":\\s") %>% # remove special characters: mutate(Utterance_Clean = str_remove_all(Utterance, "(?![\\s'])\\W")) %>% # remove rows where Utterance_Clean is digits: filter(!str_detect(Utterance_Clean, "^\\d+$")) %>% separate_rows(Utterance_Clean, sep = "\\s") %>% ##################################################### # Create ID here mutate(id = row_number(), .by = Sequ) %>% mutate(Q_new = map2_chr(str_split(Q_c7_collpsd, "\\s"), id,~ .x[[.y]])) %>% #View summarise( w_c7 = str_c(Q_new, collapse = " "), .by = c(Sequ, Segment_type, Utterance) ) #> # A tibble: 6 × 4 #> Sequ Segment_type Utterance w_c7 #> <dbl> <chr> <chr> <chr> #> 1 1 tcu_noQ °[o::h my God] oh_UH my_APPGE God_NP1 #> 2 1 tcu_pol you have n't seen my place?°= you_PPY have_VH0 n't_XX seen… #> 3 2 tcu_decl [but you] use leggings¿ but_CCB you_PPY use_VV0 legg… #> 4 2 frg or or_CC #> 5 2 tcu_decl no? no_UH #> 6 2 tcu_noQ [(I do n't know)] I_PPIS1 do_VD0 n't_XX know_V…

R 将一个字符串向量调整为与其他字符串向量完全相同的大小

推荐答案

R相关问答推荐

是否有任何解决方案可以优化VSCode中RScript的图形绘制？

从具有随机模式的字符串中提取值

使用Shiny组合和显示复制和粘贴的数据

咕噜中的元素列表：map

根据R中两个变量的两个条件删除带有dspirr的行

通过使用str_detect对具有相似字符串的组进行分组

如何使用STAT_SUMMARY向ggplot2中的密度图添加垂直线

矩阵的堆叠条形图，条形图上有数字作为标签

给定开始日期和月份(数字)，如何根据R中的开始日期和月数创建日期列

R -在先前group_by级别汇总时获取最大大小子组的计数

是否有新方法来更改Facet_WRAP(Ggplot2)中条文本的文本 colored颜色？

R如何将列名转换为更好的年和月格式

如何删除设置大小的曲线图并添加条形图顶部数字的百分比

如何阻止围堵地理密度图？

有没有办法将基于每个值中出现的两个关键字或短语的字符串向量重新编码为具有这两个值的新向量？

使用ifElse语句在ggploy中设置aes y值

为什么将负值向量提升到分数次方会得到NaN

子样本间系数检验的比较

对一个数据帧中另一个数据帧中的值进行计数

如何根据顺序/序列从数据框中排除值