通过精心设计的for
循环,您可能会获得最佳性能:
uniquemat <- function(x) {
y <- array(match(c(x), u <- unique(c(x))), dim(x))
u <- logical(length(u))
k <- logical(nrow(x))
u[y[1,]] <- k[1] <- TRUE
for (i in 2:nrow(x)) if (!any(u[y[i,]])) u[y[i,]] <- k[i] <- TRUE
x[k,]
}
uniquemat(T_data)
#> [,1] [,2]
#> [1,] 7 9
#> [2,] 8 10
#> [3,] 2 1
#> [4,] 5 4
使用更大的数据集进行基准测试:
fReduce <- function(x) { # from SamR
Reduce(\(x,y)
if(any(x %in% y)) x else c(x, y),
asplit(x, 1)
) |> matrix(ncol = ncol(x), byrow = TRUE)
}
T_data <- matrix(sample(1e4, 1e4, 1), ncol = 2)
bench::mark(
uniquemat = uniquemat(T_data),
fReduce = fReduce(T_data)
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 uniquemat 4.3ms 4.68ms 191. 570KB 29.8
#> 2 fReduce 274.2ms 274.88ms 3.64 201MB 18.2
对于大型矩阵来说,以这种方式使用Reduce
变得非常慢,因为x
迭代增长,即which is bad.对具有10万行的矩阵进行最终性能判断:
T_data <- matrix(sample(1e5, 1e5, 1), ncol = 2)
bench::mark(
uniquemat = uniquemat(T_data),
fReduce = fReduce(T_data)
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 uniquemat 86.5ms 99.8ms 7.41 5.18MB 14.8
#> 2 fReduce 24s 24s 0.0416 19.53GB 10.5