当您使用read_*
个函数读入一个对象时,您将它们读入为存储在内存中的箭头表.Arrow是围绕执行零复制操作而设计的,这意味着如果您可以直接操作Arrow对象而不是将它们拖入R,这应该有助于在处理较大对象时不创建对象的中间副本和炸毁R会话.
我有一个潜在的解决方案,它涉及使用Arrow对象,直到您将数据放到R中的最后一刻,尽管这不是最优雅的.
# Bring in libraries
suppressMessages(library(arrow))
# Make data
tf <- tempfile()
on.exit(unlink(tf))
writeLines('
{ "hello": 3.5, "world": false, "yo":{"param1":"duck1","param2":"duck2"} }
{ "hello": 3.25, "world": null, "yo":{"param1":"duck3","param2":"duck4"} }
{ "hello": 0.0, "world": true, "yo":{"param1":"duck5","param2":"duck6"} }
', tf, useBytes = TRUE)
# read in the JSON table as an Arrow Table
my_tbl <- read_json_arrow(tf, col_select = c("hello", "world"), as_data_frame = FALSE)
complex_cols <- read_json_arrow(tf, col_select = "yo", as_data_frame = FALSE)
# subselect the "yo" column - this is an Arrow ChunkedArray object
# containing a Struct at position 0
yo_col <- complex_cols[["yo"]]
yo_col
#> ChunkedArray
#> <struct<param1: string, param2: string>>
#> [
#> -- is_valid: all not null
#> -- child 0 type: string
#> [
#> "duck1",
#> "duck3",
#> "duck5"
#> ]
#> -- child 1 type: string
#> [
#> "duck2",
#> "duck4",
#> "duck6"
#> ]
#> ]
# extract the Struct by passing in the chunk number
sa <- yo_col$chunk(0)
sa
#> StructArray
#> <struct<param1: string, param2: string>>
#> -- is_valid: all not null
#> -- child 0 type: string
#> [
#> "duck1",
#> "duck3",
#> "duck5"
#> ]
#> -- child 1 type: string
#> [
#> "duck2",
#> "duck4",
#> "duck6"
#> ]
# extract the "param1" column from the Struct
param1_col <- sa[["param1"]]
param1_col
#> Array
#> <string>
#> [
#> "duck1",
#> "duck3",
#> "duck5"
#> ]
# Add the param1 column to the original Table
my_tbl[["param1"]] = param1_col
my_tbl
#> Table
#> 3 rows x 3 columns
#> $hello <double>
#> $world <bool>
#> $param1 <string>
# now pull the table into R
dplyr::collect(my_tbl)
#> # A tibble: 3 × 3
#> hello world param1
#> <dbl> <lgl> <chr>
#> 1 3.5 FALSE duck1
#> 2 3.25 NA duck3
#> 3 0 TRUE duck5
我一直在寻找如何在tidyVerse中直接做到这一点(我们已经在tidyVerse设计之后模拟了很多Arrow包设计),但我见过的许多解决方案都涉及在dplyr::select()
中运行purrr::map()
,这是一个目前还没有在Arrow中实现的工作流,我甚至不知道这是否可能.不过,如果您确实想要提出功能请求,请随时拨打open a ticket on the repo.
最后注意:在上面的示例中,这可能不会对内存占用有太大影响,但如果您有许多嵌套项要取出并重新组装到一个表中,那么您可能会看到更多好处.