在 R 中处理 Big Data 的最快方法

发布于09月19日

我有40多个CSV文件，每个大约400MB.我需要做的是读取这40多个大的CSV文件，对它们进行一些操作和格式化(如通用日期格式，将日期分隔为月、日等)，并将它们组合在一个数据框中.我在上一篇文章中搜索了阅读这些CSV文件的最快方法，即"Fread"，但即使当我使用Fread时，它也需要大约15分钟.读取每个文件需要14秒，这给我留下了相当长的运行时间.我try 通过RSQLite对单个CSV文件使用SQLite:

setwd("raw_data/sqldatabase")
db <- dbConnect(SQLite(), dbname="test_db.sqlite") ## will make, if not present
dbWriteTable(conn=db, name="your_table", value="testdata.csv", row.names=FALSE, header=TRUE)

然而，即使使用SQLite，也需要相当长的时间.用什么可以快速将40+个大CSV文件夹读入一个让操控速度非常快的"空间"？

如果我要将数据上传到数据库一次，并且如果要使操作变得非常快，那么我仍然可以，但最终的文件夹(一旦合并完成)预计将是25+GB.因此，我正在努力寻找最有效的方法来处理数据

set.seed(42) ind <- sample(1:5, size = nrow(nycflights13::flights), replace = TRUE) head(ind) dir.create("csv") i <- 1L for (dat in split(nycflights13::flights, ind)) { fwrite(dat, sprintf("csv/%s.csv", i)) i <- i + 1L } file.info(Sys.glob("csv/*")) # size isdir mode mtime ctime atime exe # csv/1.csv 6274623 FALSE 666 2022-09-19 05:21:40 2022-09-19 05:21:40 2022-09-19 05:21:54 no # csv/2.csv 6265804 FALSE 666 2022-09-19 05:21:40 2022-09-19 05:21:40 2022-09-19 05:21:52 no # csv/3.csv 6261533 FALSE 666 2022-09-19 05:21:40 2022-09-19 05:21:40 2022-09-19 05:21:43 no # csv/4.csv 6260298 FALSE 666 2022-09-19 05:21:40 2022-09-19 05:21:40 2022-09-19 05:21:49 no # csv/5.csv 6235815 FALSE 666 2022-09-19 05:21:40 2022-09-19 05:21:40 2022-09-19 05:21:46 no

计划1:没有索引

dir.create("datamart") for (fn in Sys.glob("csv/*.csv")) { X <- fread(fn) arrow::write_parquet(X, file.path("datamart", paste0(basename(fn), ".parquet"))) rm(X) gc() # optional, might help } file.info(Sys.glob("datamart/*")) # size isdir mode mtime ctime atime exe # datamart/1.csv.parquet 1251629 FALSE 666 2022-09-19 05:26:28 2022-09-19 05:26:28 2022-09-19 05:35:59 no # datamart/2.csv.parquet 1249485 FALSE 666 2022-09-19 05:26:45 2022-09-19 05:26:45 2022-09-19 05:35:59 no # datamart/3.csv.parquet 1249652 FALSE 666 2022-09-19 05:26:47 2022-09-19 05:26:47 2022-09-19 05:35:59 no # datamart/4.csv.parquet 1249772 FALSE 666 2022-09-19 05:26:48 2022-09-19 05:26:48 2022-09-19 05:35:59 no # datamart/5.csv.parquet 1245022 FALSE 666 2022-09-19 05:26:49 2022-09-19 05:26:49 2022-09-19 05:35:59 no

读入数据:

library(dplyr) library(arrow) ds <- open_dataset("datamart") nrow(ds) # [1] 336776 object.size(ds) # environment # 504 bytes with(ls.objects(envir = ds), sum(Size)) # [1] 145888 ds %>% filter(month == 1, between(day, 1, 10)) # FileSystemDataset (query) # year: int32 # month: int32 # day: int32 # dep_time: int32 # sched_dep_time: int32 # dep_delay: int32 # arr_time: int32 # sched_arr_time: int32 # arr_delay: int32 # carrier: string # flight: int32 # tailnum: string # origin: string # dest: string # air_time: int32 # distance: int32 # hour: int32 # minute: int32 # time_hour: timestamp[us, tz=UTC] # * Filter: ((month == 1) and ((day >= 1) and (day <= 10))) # See $.data for the source Arrow object

ds %>% filter(month == 1, between(day, 1, 10)) %>% collect() # year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest # <int> <int> <int> <int> <int> <int> <int> <int> <int> <char> <int> <char> <char> <char> # 1: 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ATL # 2: 2013 1 1 555 600 -5 913 854 19 B6 507 N516JB EWR FLL # 3: 2013 1 1 557 600 -3 709 723 -14 EV 5708 N829AS LGA IAD # 4: 2013 1 1 558 600 -2 923 937 -14 UA 1124 N53441 EWR SFO # 5: 2013 1 1 559 600 -1 941 910 31 AA 707 N3DUAA LGA DFW # 6: 2013 1 1 607 607 0 858 915 -17 UA 1077 N53442 EWR MIA # 7: 2013 1 1 613 610 3 925 921 4 B6 135 N635JB JFK RSW # 8: 2013 1 1 615 615 0 833 842 -9 DL 575 N326NB EWR ATL # 9: 2013 1 1 623 610 13 920 915 5 AA 1837 N3EMAA LGA MIA # 10: 2013 1 1 624 630 -6 840 830 10 MQ 4599 N518MQ LGA MSP # --- # 8823: 2013 1 10 2038 2045 -7 2140 2154 -14 B6 1178 N640JB EWR BOS # 8824: 2013 1 10 2040 2040 0 2351 2357 -6 B6 677 N809JB JFK LAX # 8825: 2013 1 10 2054 2100 -6 2202 2207 -5 US 2144 N952UW LGA BOS # 8826: 2013 1 10 2058 2100 -2 2229 2225 4 WN 530 N443WN LGA MDW # 8827: 2013 1 10 2104 2110 -6 2337 2355 -18 B6 529 N507JB EWR MCO # 8828: 2013 1 10 2129 2130 -1 148 218 -30 B6 701 N193JB JFK SJU # 8829: 2013 1 10 2159 2159 0 2247 2300 -13 EV 4519 N13124 EWR BWI # 8830: 2013 1 10 2320 2250 30 16 2354 22 B6 1018 N612JB JFK BOS # 8831: 2013 1 10 NA 635 NA NA 940 NA AA 711 N3CDAA LGA DFW # 8832: 2013 1 10 NA 700 NA NA 1007 NA UA 719 EWR DFW # 5 variables not shown: [air_time <int>, distance <int>, hour <int>, minute <int>, time_hour <POSc>]

Plan B: using year and month as nested subdirectories

在您的数据中，可编制索引的字段可能是:

原始CSV文件名，如果您希望能够根据该文件名进行筛选

一个或多个固有的类别/序号字段(不连续)

在方便性和可用性之间有一个平衡:如果一个分类变量有20,000个可能的值，那么它可能太多，其中一个会失go 很大的效率.子目录中找到的目录/文件越多，呼叫open_dataset所需的时间就越长，然后才能对其采取行动.我对此没有简单的衡量标准.

注:可以使用write_dataset，这相当于上面的open_dataset.它以同样的方式处理partitions=个.但是，如果您不确定分区字段中的每个级别对于一个文件来说都是唯一的(例如，在我的样本数据中，所有CSV文件中有month == 1个)，那么每次读取CSV文件都会覆盖上一次写入的一些数据.在本例中，正如我将在这里演示的那样，我将手动写入子目录.

# ensures .parquet files are additive addfile <- function(dat, base, by) { thisdir <- do.call(file.path, as.list(c(base, paste(names(by), unname(by), sep = "=")))) dir.create(thisdir, recursive = TRUE, showWarnings = FALSE) existing <- list.files(thisdir) thisfile <- sprintf("%i.parquet", length(existing) + 1) arrow::write_parquet(dat, file.path(thisdir, thisfile)) } dir.create("datamart2") for (fn in Sys.glob("csv/*.csv")) { X <- fread(fn) X[, addfile(.SD, "datamart2", by = .BY), by = .(year, month)] rm(X) gc() # optional } file.info(Sys.glob("datamart2/*/*/*")) # size isdir mode mtime ctime atime exe # datamart2/year=2013/month=1/1.parquet 133469 FALSE 666 2022-09-19 05:53:25 2022-09-19 05:53:25 2022-09-19 05:53:25 no # datamart2/year=2013/month=1/2.parquet 132760 FALSE 666 2022-09-19 05:53:26 2022-09-19 05:53:26 2022-09-19 05:53:26 no # datamart2/year=2013/month=1/3.parquet 134069 FALSE 666 2022-09-19 05:53:28 2022-09-19 05:53:28 2022-09-19 05:53:28 no # datamart2/year=2013/month=1/4.parquet 132404 FALSE 666 2022-09-19 05:53:29 2022-09-19 05:53:29 2022-09-19 05:53:29 no # datamart2/year=2013/month=1/5.parquet 136424 FALSE 666 2022-09-19 05:53:30 2022-09-19 05:53:30 2022-09-19 05:53:30 no # datamart2/year=2013/month=10/1.parquet 140490 FALSE 666 2022-09-19 05:53:25 2022-09-19 05:53:25 2022-09-19 05:53:25 no # datamart2/year=2013/month=10/2.parquet 139362 FALSE 666 2022-09-19 05:53:26 2022-09-19 05:53:26 2022-09-19 05:53:26 no # datamart2/year=2013/month=10/3.parquet 138570 FALSE 666 2022-09-19 05:53:28 2022-09-19 05:53:28 2022-09-19 05:53:28 no # datamart2/year=2013/month=10/4.parquet 137501 FALSE 666 2022-09-19 05:53:29 2022-09-19 05:53:29 2022-09-19 05:53:29 no # datamart2/year=2013/month=10/5.parquet 137426 FALSE 666 2022-09-19 05:53:30 2022-09-19 05:53:30 2022-09-19 05:53:30 no # datamart2/year=2013/month=11/1.parquet 133714 FALSE 666 2022-09-19 05:53:25 2022-09-19 05:53:25 2022-09-19 05:53:25 no # datamart2/year=2013/month=11/2.parquet 134291 FALSE 666 2022-09-19 05:53:26 2022-09-19 05:53:26 2022-09-19 05:53:26 no # datamart2/year=2013/month=11/3.parquet 133199 FALSE 666 2022-09-19 05:53:28 2022-09-19 05:53:28 2022-09-19 05:53:28 no # datamart2/year=2013/month=11/4.parquet 136152 FALSE 666 2022-09-19 05:53:29 2022-09-19 05:53:29 2022-09-19 05:53:29 no # datamart2/year=2013/month=11/5.parquet 133310 FALSE 666 2022-09-19 05:53:30 2022-09-19 05:53:30 2022-09-19 05:53:30 no # datamart2/year=2013/month=12/1.parquet 141743 FALSE 666 2022-09-19 05:53:25 2022-09-19 05:53:25 2022-09-19 05:53:25 no # datamart2/year=2013/month=12/2.parquet 142030 FALSE 666 2022-09-19 05:53:26 2022-09-19 05:53:26 2022-09-19 05:53:26 no # datamart2/year=2013/month=12/3.parquet 139573 FALSE 666 2022-09-19 05:53:28 2022-09-19 05:53:28 2022-09-19 05:53:28 no # datamart2/year=2013/month=12/4.parquet 140515 FALSE 666 2022-09-19 05:53:29 2022-09-19 05:53:29 2022-09-19 05:53:29 no # datamart2/year=2013/month=12/5.parquet 140059 FALSE 666 2022-09-19 05:53:30 2022-09-19 05:53:30 2022-09-19 05:53:30 no # datamart2/year=2013/month=2/1.parquet 126203 FALSE 666 2022-09-19 05:53:25 2022-09-19 05:53:25 2022-09-19 05:53:25 no # datamart2/year=2013/month=2/2.parquet 126481 FALSE 666 2022-09-19 05:53:26 2022-09-19 05:53:26 2022-09-19 05:53:26 no # datamart2/year=2013/month=2/3.parquet 126348 FALSE 666 2022-09-19 05:53:28 2022-09-19 05:53:28 2022-09-19 05:53:28 no # datamart2/year=2013/month=2/4.parquet 126618 FALSE 666 2022-09-19 05:53:29 2022-09-19 05:53:29 2022-09-19 05:53:29 no # datamart2/year=2013/month=2/5.parquet 123947 FALSE 666 2022-09-19 05:53:30 2022-09-19 05:53:30 2022-09-19 05:53:30 no # datamart2/year=2013/month=3/1.parquet 140691 FALSE 666 2022-09-19 05:53:25 2022-09-19 05:53:25 2022-09-19 05:53:25 no # datamart2/year=2013/month=3/2.parquet 142811 FALSE 666 2022-09-19 05:53:26 2022-09-19 05:53:26 2022-09-19 05:53:26 no # datamart2/year=2013/month=3/3.parquet 142415 FALSE 666 2022-09-19 05:53:28 2022-09-19 05:53:28 2022-09-19 05:53:28 no # datamart2/year=2013/month=3/4.parquet 140573 FALSE 666 2022-09-19 05:53:29 2022-09-19 05:53:29 2022-09-19 05:53:29 no # datamart2/year=2013/month=3/5.parquet 138510 FALSE 666 2022-09-19 05:53:30 2022-09-19 05:53:30 2022-09-19 05:53:30 no # datamart2/year=2013/month=4/1.parquet 140734 FALSE 666 2022-09-19 05:53:25 2022-09-19 05:53:25 2022-09-19 05:53:25 no # datamart2/year=2013/month=4/2.parquet 140707 FALSE 666 2022-09-19 05:53:26 2022-09-19 05:53:26 2022-09-19 05:53:26 no # datamart2/year=2013/month=4/3.parquet 140507 FALSE 666 2022-09-19 05:53:28 2022-09-19 05:53:28 2022-09-19 05:53:28 no # datamart2/year=2013/month=4/4.parquet 141896 FALSE 666 2022-09-19 05:53:29 2022-09-19 05:53:29 2022-09-19 05:53:29 no # datamart2/year=2013/month=4/5.parquet 141182 FALSE 666 2022-09-19 05:53:30 2022-09-19 05:53:30 2022-09-19 05:53:30 no # datamart2/year=2013/month=5/1.parquet 139517 FALSE 666 2022-09-19 05:53:25 2022-09-19 05:53:25 2022-09-19 05:53:25 no # datamart2/year=2013/month=5/2.parquet 140546 FALSE 666 2022-09-19 05:53:26 2022-09-19 05:53:26 2022-09-19 05:53:26 no # datamart2/year=2013/month=5/3.parquet 143193 FALSE 666 2022-09-19 05:53:28 2022-09-19 05:53:28 2022-09-19 05:53:28 no # datamart2/year=2013/month=5/4.parquet 139979 FALSE 666 2022-09-19 05:53:29 2022-09-19 05:53:29 2022-09-19 05:53:29 no # datamart2/year=2013/month=5/5.parquet 141259 FALSE 666 2022-09-19 05:53:30 2022-09-19 05:53:30 2022-09-19 05:53:30 no # datamart2/year=2013/month=6/1.parquet 143405 FALSE 666 2022-09-19 05:53:25 2022-09-19 05:53:25 2022-09-19 05:53:25 no # datamart2/year=2013/month=6/2.parquet 142591 FALSE 666 2022-09-19 05:53:26 2022-09-19 05:53:26 2022-09-19 05:53:26 no # datamart2/year=2013/month=6/3.parquet 142106 FALSE 666 2022-09-19 05:53:28 2022-09-19 05:53:28 2022-09-19 05:53:28 no # datamart2/year=2013/month=6/4.parquet 143012 FALSE 666 2022-09-19 05:53:29 2022-09-19 05:53:29 2022-09-19 05:53:29 no # datamart2/year=2013/month=6/5.parquet 141489 FALSE 666 2022-09-19 05:53:30 2022-09-19 05:53:30 2022-09-19 05:53:30 no # datamart2/year=2013/month=7/1.parquet 145064 FALSE 666 2022-09-19 05:53:25 2022-09-19 05:53:25 2022-09-19 05:53:25 no # datamart2/year=2013/month=7/2.parquet 143898 FALSE 666 2022-09-19 05:53:26 2022-09-19 05:53:26 2022-09-19 05:53:26 no # datamart2/year=2013/month=7/3.parquet 144104 FALSE 666 2022-09-19 05:53:28 2022-09-19 05:53:28 2022-09-19 05:53:28 no # datamart2/year=2013/month=7/4.parquet 146099 FALSE 666 2022-09-19 05:53:29 2022-09-19 05:53:29 2022-09-19 05:53:29 no # datamart2/year=2013/month=7/5.parquet 146616 FALSE 666 2022-09-19 05:53:30 2022-09-19 05:53:30 2022-09-19 05:53:30 no # datamart2/year=2013/month=8/1.parquet 145155 FALSE 666 2022-09-19 05:53:25 2022-09-19 05:53:25 2022-09-19 05:53:25 no # datamart2/year=2013/month=8/2.parquet 143314 FALSE 666 2022-09-19 05:53:26 2022-09-19 05:53:26 2022-09-19 05:53:26 no # datamart2/year=2013/month=8/3.parquet 145334 FALSE 666 2022-09-19 05:53:28 2022-09-19 05:53:28 2022-09-19 05:53:28 no # datamart2/year=2013/month=8/4.parquet 144581 FALSE 666 2022-09-19 05:53:29 2022-09-19 05:53:29 2022-09-19 05:53:29 no # datamart2/year=2013/month=8/5.parquet 145998 FALSE 666 2022-09-19 05:53:30 2022-09-19 05:53:30 2022-09-19 05:53:30 no # datamart2/year=2013/month=9/1.parquet 135902 FALSE 666 2022-09-19 05:53:25 2022-09-19 05:53:25 2022-09-19 05:53:25 no # datamart2/year=2013/month=9/2.parquet 135525 FALSE 666 2022-09-19 05:53:26 2022-09-19 05:53:26 2022-09-19 05:53:26 no # datamart2/year=2013/month=9/3.parquet 136012 FALSE 666 2022-09-19 05:53:28 2022-09-19 05:53:28 2022-09-19 05:53:28 no # datamart2/year=2013/month=9/4.parquet 137506 FALSE 666 2022-09-19 05:53:29 2022-09-19 05:53:29 2022-09-19 05:53:29 no # datamart2/year=2013/month=9/5.parquet 133894 FALSE 666 2022-09-19 05:53:30 2022-09-19 05:53:30 2022-09-19 05:53:30 no

读取数据的过程与第1部分相同，不过请注意，索引year和month是最后两列，而不是前两列:

library(dplyr) library(arrow) ds <- open_dataset("datamart2") nrow(ds) # [1] 336776 object.size(ds) # environment # 504 bytes with(ls.objects(envir = ds), sum(Size)) # [1] 155896 ds %>% filter(month == 1, between(day, 1, 10)) # FileSystemDataset (query) # day: int32 # dep_time: int32 # sched_dep_time: int32 # dep_delay: int32 # arr_time: int32 # sched_arr_time: int32 # arr_delay: int32 # carrier: string # flight: int32 # tailnum: string # origin: string # dest: string # air_time: int32 # distance: int32 # hour: int32 # minute: int32 # time_hour: timestamp[us, tz=UTC] # year: int32 # month: int32 # * Filter: ((month == 1) and ((day >= 1) and (day <= 10))) # See $.data for the source Arrow object ds %>% filter(month == 1, between(day, 1, 10)) %>% collect() # day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time # <int> <int> <int> <int> <int> <int> <int> <char> <int> <char> <char> <char> <int> # 1: 1 517 515 2 830 819 11 UA 1545 N14228 EWR IAH 227 # 2: 1 542 540 2 923 850 33 AA 1141 N619AA JFK MIA 160 # 3: 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK BQN 183 # 4: 1 557 600 -3 838 846 -8 B6 79 N593JB JFK MCO 140 # 5: 1 558 600 -2 849 851 -2 B6 49 N793JB JFK PBI 149 # 6: 1 559 600 -1 854 902 -8 UA 1187 N76515 EWR LAS 337 # 7: 1 600 600 0 851 858 -7 B6 371 N595JB LGA FLL 152 # 8: 1 615 615 0 1039 1100 -21 B6 709 N794JB JFK SJU 182 # 9: 1 635 635 0 1028 940 48 AA 711 N3GKAA LGA DFW 248 # 10: 1 655 655 0 1021 1030 -9 DL 1415 N3763D JFK SLC 294 # --- # 8823: 10 2038 2045 -7 2140 2154 -14 B6 1178 N640JB EWR BOS 40 # 8824: 10 2040 2040 0 2351 2357 -6 B6 677 N809JB JFK LAX 343 # 8825: 10 2054 2100 -6 2202 2207 -5 US 2144 N952UW LGA BOS 34 # 8826: 10 2058 2100 -2 2229 2225 4 WN 530 N443WN LGA MDW 117 # 8827: 10 2104 2110 -6 2337 2355 -18 B6 529 N507JB EWR MCO 127 # 8828: 10 2129 2130 -1 148 218 -30 B6 701 N193JB JFK SJU 186 # 8829: 10 2159 2159 0 2247 2300 -13 EV 4519 N13124 EWR BWI 33 # 8830: 10 2320 2250 30 16 2354 22 B6 1018 N612JB JFK BOS 35 # 8831: 10 NA 635 NA NA 940 NA AA 711 N3CDAA LGA DFW NA # 8832: 10 NA 700 NA NA 1007 NA UA 719 EWR DFW NA # 6 variables not shown: [distance <int>, hour <int>, minute <int>, time_hour <POSc>, year <int>, month <int>]

备注:

如果您的系统能够做到这一点，那么可能值得您努力迭代每个year=?/month=?个子目录，并将所有数据合并到一个文件中.还记得我提到的"文件多，慢open_dataset"吗？组合同级.parquet文件可能有助于解决此问题.同样，只有当您一次可以将一个子目录集的所有内容读入内存时，这才是可能的.这可能没有必要，交给你吧.

我使用了"hive"分区/子目录方案，其中year=隐式创建可索引的列.(如果您阅读1.parquet或directory中的一个，您会注意到其中没有year或month.)如果这不是您想要的，您可以创建更简单的路径名(例如，2013/12/1.parquet)并使用open_dataset("datamart2", partitions=c("year","month")).

在 R 中处理 Big Data 的最快方法

推荐答案

计划1:没有索引

Plan B: using `year` and `month` as nested subdirectories

R相关问答推荐

使用预定值列表将模拟数量(n)替换为rnorm()

查找具有平局的多个列的最大值并返回列名或平局 destruct 者NA值

检测(并替换)字符串中的数学符号

从嵌套列表中智能提取线性模型系数

编辑文件后编辑RhandsonTable

更改编号列表的 colored颜色

R中的子集文件—读取文件名索引为4位数字序列，例如0001到4000，而不是1到4000)

整数成随机顺序与约束R？

如何动态更新selectizeInput？

错误：非常长的R行中出现意外符号

如何计算多个日期是否在一个日期范围内

如何在科学记数法中显示因子

按列中显示的配对组估算NA值

防止在更新SHINY中的Reactive Value的部分内容时触发依赖事件

将列的值乘以在不同数据集中找到的值

如何判断代码是否在R Markdown(RMD)上下文中交互运行？

R没有按顺序显示我的有序系数？

R：改进实现简单模型

Data.table条件合并

如何在分组蜂群小区中正确定位标签

推荐答案

计划1:没有索引

Plan B: using year and month as nested subdirectories

R相关问答推荐

Plan B: using `year` and `month` as nested subdirectories