我有两个大的稀疏矩阵(大小约为41000 x 55000).非零元素的密度约为10%.对于非零元素,它们都具有相同的行索引和列索引.
如果第二个矩阵中的值低于某个阈值,我现在想修改第一个稀疏矩阵中的值.
library(Matrix)
# Generating the example matrices.
set.seed(42)
# Rows with values.
i <- sample(1:41000, 227000000, replace = TRUE)
# Columns with values.
j <- sample(1:55000, 227000000, replace = TRUE)
# Values for the first matrix.
x1 <- runif(227000000)
# Values for the second matrix.
x2 <- sample(1:3, 227000000, replace = TRUE)
# Constructing the matrices.
m1 <- sparseMatrix(i = i, j = j, x = x1)
m2 <- sparseMatrix(i = i, j = j, x = x2)
现在,我从新矩阵中的第一个矩阵中获取行、列和值.这样,我可以简单地将它们子集,只保留我感兴趣的部分.
# Getting the positions and values from the matrices.
position_matrix_from_m1 <- rbind(i = m1@i, j = summary(m1)$j, x = m1@x)
position_matrix_from_m2 <- rbind(i = m2@i, j = summary(m2)$j, x = m2@x)
# Subsetting to get the elements of interest.
position_matrix_from_m1 <- position_matrix_from_m1[,position_matrix_from_m1[3,] > 0 & position_matrix_from_m1[3,] < 0.05]
# We add 1 to the values, since the sparse matrix is 0-based.
position_matrix_from_m1[1,] <- position_matrix_from_m1[1,] + 1
position_matrix_from_m1[2,] <- position_matrix_from_m1[2,] + 1
现在我有麻烦了.覆盖第二个矩阵中的值花费的时间太长.我让它运行了几个小时,但没有完成.
# This takes hours.
m2[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 1
m1[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 0
我考虑将行和列信息粘贴在一起.然后每个值都有一个唯一的标识符.这也需要很长时间,可能只是一种非常糟糕的做法.
# We would get the unique identifiers after the subsetting.
m1_identifiers <- paste0(position_matrix_from_m1[1,], "_", position_matrix_from_m1[2,])
m2_identifiers <- paste0(position_matrix_from_m2[1,], "_", position_matrix_from_m2[2,])
# Now, I could use which and get the position of the values I want to change.
# This also uses to much memory.
m2_identifiers_of_interest <- which(m2_identifiers %in% m1_identifiers)
# Then I would modify the x values in the position_matrix_from_m2 matrix and overwrite m2@x in the sparse matrix object.
我的方法中是否存在根本错误?我应该怎么做才能有效地运行它?