将 11 GB .csv 文件加载为 big.matrix 对象

发布于07月24日

我有一个11GB.csv文件，我最终需要它作为一个big.matrix对象.从我所读到的内容来看，我认为我需要创建一个filebacked big.matrix对象，但我不知道如何做到这一点.

文件太大了，我无法直接加载到R中并从那里进行操作，就像我处理较小的数据集一样.如何从.csv文件生成big.matrix对象？

推荐答案

看看这是否有帮助.我将其作为答案发布，因为它包含的注释代码太多.

The strategy is to read chunks of 10K rows at a time and coerce them to a sparse matrix. Then, rbind those sub-matrices together.
It uses data.table::fread for speed and a function in package fpeek to count the number of lines in the data file. This function is also fast.

library(data.table)
library(Matrix)

flname <- "your_filename"
nlines <- fpeek::peek_count_lines(flname)
chunk <- 10*1024

passes <- nlines %/% chunk
remaining <- nlines %% chunk
skip <- 0

data_list <- vector("list", length = passes + (remaining > 0))
for(i in seq_len(passes)) {
  tmp <- fread(flname, sep = ",", colClasses = "double", skip = skip, nrows = chunk)
  data_list[[i]] <- Matrix(as.matrix(tmp), sparse = TRUE)
  skip <- skip + chunk
}
if(remaining > 0) {
  tmp <- fread(flname, sep = ",", colClasses = "double", skip = skip)
  data_list[[passes + 1L]] <- Matrix(as.matrix(tmp), sparse = TRUE)
}

sparse_mat <- do.call(rbind, data_list)
rm(data_list)

Test data

有了以下测试数据，一切正常.我还try 了一个更大的矩阵.

path是可选的.

path <- "~/Temp"
flname <- file.path(path, "big_example.csv")
a <- matrix(1:(25*1024), ncol = 1)
b <- matrix(rbinom(25*1024*10, size = 1, prob = 0.01), ncol = 10)
a <- cbind(a, b)
dim(a)
write.csv(a, fl, row.names = FALSE)