我正在try 使用新的separate_wider_regex()函数来分隔字符串.

在第一个例子(moldavia_1)中,我们有一个模式.因此,获取所有列很简单:

moldavia_1 <- 
  tibble(adresa = c("1;MD-3101,Balti str-la Botu Pavel 3",
                    "3;MD-3102,Balti str-la Muresanu A. 11",
                    "17;MD-3102,Balti str-la Sorocii 46",
                    "398;MD-3111,Balti str-la Stefan cel Mare 20",
                    "1130;MD-3128,Balti str-la Lvovului 2",
                    "1252;MD-3128,Balti str-la Lvovului 1",
                    "2814;MD-3102,Balti str-la Cahulului 44"))

使用separate_wider_regex():

moldavia_1 %>% 
  separate_wider_regex(cols = adresa,
                       patterns = c(ids = "^\\d+", 
                                    ";",
                                    cod_post = ".*",
                                    ",",
                                    cod_4 = "\\w+",
                                    "\\s",
                                    str = "str-la",
                                    den_str = "\\s[A-Z][a-z]+.*(?<=[a-z]|[A-Z][:punct:])\\s(?=[0-9])",
                                    nr = "\\d+$"))

# A tibble: 7 × 6
  ids   cod_post cod_4 str    den_str             nr   
  <chr> <chr>    <chr> <chr>  <chr>               <chr>
1 1     MD-3101  Balti str-la " Botu Pavel "      3    
2 3     MD-3102  Balti str-la " Muresanu A. "     11   
3 17    MD-3102  Balti str-la " Sorocii "         46   
4 398   MD-3111  Balti str-la " Stefan cel Mare " 20   
5 1130  MD-3128  Balti str-la " Lvovului "        2    
6 1252  MD-3128  Balti str-la " Lvovului "        1    
7 2814  MD-3102  Balti str-la " Cahulului "       44 

在第二个示例(moldavia_2)中,如果模式失败(在本例中是列"str"),则子序列列也会失败.

moldavia_2 <- 
  tibble(adresa = c("1;MD-3101,Balti Botu Pavel 3",
                    "3;MD-3102,Balti str-la Muresanu A. 11",
                    "17;MD-3102,Balti Sorocii 46",
                    "398;MD-3111,Balti Stefan cel Mare 20",
                    "1130;MD-3128,Balti str-la Lvovului 2",
                    "1252;MD-3128,Balti str-la Lvovului 1",
                    "2814;MD-3102,Balti Cahulului 44"))

使用separate_wider_regex():

moldavia_2 %>% 
  separate_wider_regex(cols = adresa,
                       patterns = c(ids = "^\\d+", 
                                    ";",
                                    cod_post = ".*",
                                    ",",
                                    cod_4 = "\\w+",
                                    "\\s",
                                    str = "str-la",
                                    den_str = "\\s[A-Z][a-z]+.*(?<=[a-z]|[A-Z][:punct:])\\s(?=[0-9])",
                                    nr = "\\d+$"),
                       too_few = "align_start")


# A tibble: 7 × 6
  ids   cod_post cod_4 str    den_str         nr   
  <chr> <chr>    <chr> <chr>  <chr>           <chr>
1 1     MD-3101  Balti NA      NA             NA   
2 3     MD-3102  Balti str-la " Muresanu A. " 11   
3 17    MD-3102  Balti NA      NA             NA   
4 398   MD-3111  Balti NA      NA             NA   
5 1130  MD-3128  Balti str-la " Lvovului "    2    
6 1252  MD-3128  Balti str-la " Lvovului "    1    
7 2814  MD-3102  Balti NA      NA             NA   

我期待着:

# A tibble: 7 × 6
  ids   cod_post cod_4 tip_str den_str         nr   
  <chr> <chr>    <chr> <chr>   <chr>           <chr>
1 1     MD-3101  Balti NA      Botu Pavel      3    
2 3     MD-3102  Balti str-la  Muresanu A.     11   
3 17    MD-3102  Balti NA      Sorocii         46   
4 398   MD-3111  Balti NA      Stefan cel Mare 20   
5 1130  MD-3128  Balti str-la  Lvovului        2    
6 1252  MD-3128  Balti str-la  Lvovului        1    
7 2814  MD-3102  Balti NA      Cahulului       44   

推荐答案

有一种方法可以做到这一点:我们使用非捕获组(?:)并判断字符串是str-la后跟空格,还是仅仅是空格\\w.

library(dplyr)
library(tidyr)


moldavia_2 %>% 
  separate_wider_regex(cols = adresa,
                       patterns = c(ids = "\\d+;", 
                                    cod_post = ".*,", 
                                    cod_4 = "\\w+ ",
                                    str = "(?:str-la\\s|\\s)?",
                                    den_str = "[A-Z][a-z]+.*(?<=[a-z]|[A-Z][:punct:])\\s(?=[0-9])",
                                    nr = "\\d+$"),
                       too_few = "align_start")

#> # A tibble: 7 × 6
#>   ids   cod_post cod_4    str       den_str            nr   
#>   <chr> <chr>    <chr>    <chr>     <chr>              <chr>
#> 1 1;    MD-3101, "Balti " ""        "Botu Pavel "      3    
#> 2 3;    MD-3102, "Balti " "str-la " "Muresanu A. "     11   
#> 3 17;   MD-3102, "Balti " ""        "Sorocii "         46   
#> 4 398;  MD-3111, "Balti " ""        "Stefan cel Mare " 20   
#> 5 1130; MD-3128, "Balti " "str-la " "Lvovului "        2    
#> 6 1252; MD-3128, "Balti " "str-la " "Lvovului "        1    
#> 7 2814; MD-3102, "Balti " ""        "Cahulului "       44

来自运营部门的数据

moldavia_2 <- 
  tibble(adresa = c("1;MD-3101,Balti Botu Pavel 3",
                    "3;MD-3102,Balti str-la Muresanu A. 11",
                    "17;MD-3102,Balti Sorocii 46",
                    "398;MD-3111,Balti Stefan cel Mare 20",
                    "1130;MD-3128,Balti str-la Lvovului 2",
                    "1252;MD-3128,Balti str-la Lvovului 1",
                    "2814;MD-3102,Balti Cahulului 44")
         )

创建于2023-03-22,共reprex v2.0.2

R相关问答推荐

如何将多个数据帧附加到R中的多个相应的CSV文件中?

如果窗口在CLARME或集团之外,则有条件领先/滞后滚动总和返回NA

按块将载体转换为矩阵-reshape

从API中抓取R数据SON

迭代通过1个长度的字符串长字符R

使用strsplit()将向量操作为数据框

单个轮廓重叠条的单独图例

在R中为马赛克图中的每个字段着色

R根据条件进行累积更改

如何得到每四个元素向量R?

从所有项的 struct 相同的两级列表中,将该第二级中的所有同名项绑定在一起

提取第一个下划线和最后一个下划线之间的任何内容,例外情况除外

用R ggplot2求上、下三角形中两个变量的矩阵热图

如何在R中通过多个变量创建交叉表?

使用不同的定性属性定制主成分分析中点的 colored颜色 和形状

手动指定从相同数据创建的叠加图的 colored颜色

将数据集旋转到长格式,用于遵循特定名称模式的所有变量对

使用gt_summary是否有一种方法来限制每个变量集进行配对比较?

`-`是否也用于数据帧,有时使用引用调用?

判断函数未加载R中的库