我在R中使用欧盟统计局的数据,它的变量中有Geopolitical entity (reporting)
,而这些变量通常会采用"Euro Area - 12 countries (2001-2006)"
或"European Union - 27 countries (from 2020)"
这样的值.现在,我想缩写Geopolitical entity (reporting)
中以"Euro"开头的所有值,这样我就只剩下EA12或EU27这样的值,即保留前两个单词的第一个字母,然后是国家数.我知道我需要使用mutate
和case_when
以及gsub
或sub
,但我在正则表达式方面从来都不擅长.对于上下文:
library(tidyverse)
library(eurostat)
df <- get_eurostat("spr_exp_pens",type = "label")
colnames(df)[1:5] <- label_eurostat_vars(df)
此外,我知道我还需要继续:
df %>% mutate(`Geopolitical entity (reporting)` =
case_when( ...
有谁能帮帮我吗?
编辑:我意识到Geopolitical entity (reporting)
也采用Germany (until 1990 former territory of the FRG)
和European Economic Area (EEA18-1995, EEA28-2004, EEA30-2007, EEA31-2013, EEA30-2020)
个的值,并希望将它们也缩短为Germany
和EEA
.
我取得了一些进展:
df <- df %>% mutate(`Geopolitical entity (reporting)` =
case_when(
grepl("Germany", `Geopolitical entity (reporting)`, ignore.case = TRUE) ~ "Germany",
startsWith(`Geopolitical entity (reporting)`, "Euro") ~
sub("^([A-Za-z])[A-Za-z]*\\s([A-Za-z])[A-Za-z]*\\s-\\s(\\d+).*", "\\U\\1\\U\\2\\3", `Geopolitical entity (reporting)`, perl = TRUE),
startsWith(`Geopolitical entity (reporting)`, "European Economic Area") ~ "EEA",
TRUE ~ `Geopolitical entity (reporting)`))
然而,有两个价值仍然存在:
-
European Economic Area (EEA18-1995, EEA28-2004, EEA30-2007, EEA31-2013, EEA30-2020)
个 -
Euro area – 20 countries (from 2023)
个
有人能解释一下为什么以及如何解决最后这些问题吗?