他们在做同样的事情吗?No.几个原因:
在您的SQL代码中,您有partition by id
,而数据链路中的类似功能是group_by(id)
,您已经省略了它.
您的数据中包含整数,但是您正在通过使用base::ifelse(., ., "END")
悄悄地将它们转换( destruct )为character
.允许R静默地强制转换为不同的类是不好的做法,特别是当下游需要的是整数时.如果您希望所有字段都是字符串,那么我建议您显式地将它们转换为字符串first.
为此,我下面的代码将保留integer
类直到最后,您可以 Select 将它们更改为具有最后mutate(..)
的字符串.
除非您知道严格地将"END"
字符串的值设为need,否则我建议您最好使用NA
或其他类似数字的前哨值,例如Inf
.这在很大程度上取决于你如何使用它.
在R代码中,您将空值(NA
)替换为"END"
,并且您不会在SQL中try 这样做.将数据CAST
转换为varchar
(或char
或...)后,正确的SQL动词是COALESCE
.
仅供参考,mutate_at
已经被取代了一段时间,我建议你换到across
.
以下是更新后的代码块.
数据链路
start %>%
group_by(id) %>%
mutate(across(everything(), ~ lead(.), .names = "{.col}_end")) %>%
ungroup() %>%
rename_with(.fn = ~ paste0(., "_start"), .cols = -c(id, ends_with("_end")))
# # A tibble: 7 × 9
# id year_start col2_start col3_start col4_start year_end col2_end col3_end col4_end
# <int> <int> <chr> <int> <int> <int> <chr> <int> <int>
# 1 111 2010 A 242 1213 2011 BB 213 5959
# 2 111 2011 BB 213 5959 2012 A 233 9988
# 3 111 2012 A 233 9988 2013 C 455 4242
# 4 111 2013 C 455 4242 NA <NA> NA NA
# 5 222 2018 D 11 333 2019 EE 444 1232
# 6 222 2019 EE 444 1232 2020 F 123 98
# 7 222 2020 F 123 98 NA <NA> NA NA
如果你真的需要"END"
来表示组结束(而不是这里看起来明确的NA
),那么再加上:
... %>%
mutate(across(matches("_(start|end)$"), ~ coalesce(as.character(.), "END")))
# # A tibble: 7 × 9
# id year_start col2_start col3_start col4_start year_end col2_end col3_end col4_end
# <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 111 2010 A 242 1213 2011 BB 213 5959
# 2 111 2011 BB 213 5959 2012 A 233 9988
# 3 111 2012 A 233 9988 2013 C 455 4242
# 4 111 2013 C 455 4242 END END END END
# 5 222 2018 D 11 333 2019 EE 444 1232
# 6 222 2019 EE 444 1232 2020 F 123 98
# 7 222 2020 F 123 98 END END END END
SQL
您的初始代码生成的代码与上面的第一个代码块相同(保留integer
个类).我将在结尾处添加tibble
,只是为了获得它的专栏类提示,而不是为了给这个讨论增加任何其他东西:
sqldf::sqldf("
SELECT
id,
year AS year_start,
LEAD(year) OVER (PARTITION BY id ORDER BY year) AS year_end,
col2 AS col2_start,
LEAD(col2) OVER (PARTITION BY id ORDER BY year) AS col2_end,
col3 AS col3_start,
LEAD(col3) OVER (PARTITION BY id ORDER BY year) AS col3_end,
col4 AS col4_start,
LEAD(col4) OVER (PARTITION BY id ORDER BY year) AS col4_end
FROM start;") %>%
tibble()
# # A tibble: 7 × 9
# id year_start year_end col2_start col2_end col3_start col3_end col4_start col4_end
# <int> <int> <int> <chr> <chr> <int> <int> <int> <int>
# 1 111 2010 2011 A BB 242 213 1213 5959
# 2 111 2011 2012 BB A 213 233 5959 9988
# 3 111 2012 2013 A C 233 455 9988 4242
# 4 111 2013 NA C <NA> 455 NA 4242 NA
# 5 222 2018 2019 D EE 11 444 333 1232
# 6 222 2019 2020 EE F 444 123 1232 98
# 7 222 2020 NA F <NA> 123 NA 98 NA
如果您需要将SQL转换为also,将数字更改为字符串并使用"END"
,那么
sqldf::sqldf("SELECT
id,
CAST(year as varchar(8)) AS year_start,
COALESCE(CAST(LEAD(year) OVER (PARTITION BY id ORDER BY year) as VARCHAR(9)), 'END') AS year_end,
CAST(col2 as varchar(8)) AS col2_start,
COALESCE(CAST(LEAD(col2) OVER (PARTITION BY id ORDER BY year) as VARCHAR(9)), 'END') AS col2_end,
CAST(col3 as varchar(8)) AS col3_start,
COALESCE(CAST(LEAD(col3) OVER (PARTITION BY id ORDER BY year) as VARCHAR(9)), 'END') AS col3_end,
CAST(col4 as varchar(8)) AS col4_start,
COALESCE(CAST(LEAD(col4) OVER (PARTITION BY id ORDER BY year) as VARCHAR(9)), 'END') AS col4_end
FROM start;") %>%
tibble()
# # A tibble: 7 × 9
# id year_start year_end col2_start col2_end col3_start col3_end col4_start col4_end
# <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 111 2010 2011 A BB 242 213 1213 5959
# 2 111 2011 2012 BB A 213 233 5959 9988
# 3 111 2012 2013 A C 233 455 9988 4242
# 4 111 2013 END C END 455 END 4242 END
# 5 222 2018 2019 D EE 11 444 333 1232
# 6 222 2019 2020 EE F 444 123 1232 98
# 7 222 2020 END F END 123 END 98 END
如果您到目前为止还没有推断出来,那么我是一个坚持变量class
一致性的人:我对对象的类做了太多的假设,当R静默地将变量强制转换为字符串或类似的类型时,我就受阻了.ifelse
是这一问题的常见罪魁祸首,对比ifelse(c(T,T), 1, "A")
和ifelse(c(T,F), 1, "A")
;它在other ways中也不是类安全的.