我有一个11GB.csv文件,我最终需要它作为一个big.matrix
对象.从我所读到的内容来看,我认为我需要创建一个filebacked big.matrix
对象,但我不知道如何做到这一点.
文件太大了,我无法直接加载到R中并从那里进行操作,就像我处理较小的数据集一样.如何从.csv文件生成big.matrix
对象?
我有一个11GB.csv文件,我最终需要它作为一个big.matrix
对象.从我所读到的内容来看,我认为我需要创建一个filebacked big.matrix
对象,但我不知道如何做到这一点.
文件太大了,我无法直接加载到R中并从那里进行操作,就像我处理较小的数据集一样.如何从.csv文件生成big.matrix
对象?
看看这是否有帮助.我将其作为答案发布,因为它包含的注释代码太多.
The strategy is to read chunks of 10K rows at a time and coerce them to a sparse matrix. Then, rbind
those sub-matrices together.
It uses data.table::fread
for speed and a function in package fpeek
to count the number of lines in the data file. This function is also fast.
library(data.table)
library(Matrix)
flname <- "your_filename"
nlines <- fpeek::peek_count_lines(flname)
chunk <- 10*1024
passes <- nlines %/% chunk
remaining <- nlines %% chunk
skip <- 0
data_list <- vector("list", length = passes + (remaining > 0))
for(i in seq_len(passes)) {
tmp <- fread(flname, sep = ",", colClasses = "double", skip = skip, nrows = chunk)
data_list[[i]] <- Matrix(as.matrix(tmp), sparse = TRUE)
skip <- skip + chunk
}
if(remaining > 0) {
tmp <- fread(flname, sep = ",", colClasses = "double", skip = skip)
data_list[[passes + 1L]] <- Matrix(as.matrix(tmp), sparse = TRUE)
}
sparse_mat <- do.call(rbind, data_list)
rm(data_list)
有了以下测试数据,一切正常.我还try 了一个更大的矩阵.
path
是可选的.
path <- "~/Temp"
flname <- file.path(path, "big_example.csv")
a <- matrix(1:(25*1024), ncol = 1)
b <- matrix(rbinom(25*1024*10, size = 1, prob = 0.01), ncol = 10)
a <- cbind(a, b)
dim(a)
write.csv(a, fl, row.names = FALSE)