R 悬崖三角洲超大型群数计算导致整数溢出

发布于02月28日

在R中，我需要计算悬崖三角洲.以下是公式:

其中x_i是A组中的观测值，x_j是B组中的观测值，如果x i x_j为真，则[x_ix_j]为1

以下是克里夫(1993)的非数学解释:

...将一组中的n个X中的每一个与m中的每一个进行比较另一个，并计算该成员的第一组较高，多少次较低.

然后除以两两比较的次数，即每组观察次数的乘积.

问题来了:在我的数据中，A组有66208个观测数据，B组有228691个观测数据.66208*228691导致R中的整数溢出.

set.seed(123)
a <- runif(228691)
b <- runif(66208)

x <- length(a) * length(b)

Warning message:
In length(a) * length(b) : NAs produced by integer overflow

因此，我不确定在计算悬崖三角洲时是否可以信任effsize的结果，因为它抛出了类似的警告:

library(effsize)
x <- cliff.delta(a, b)

Warning messages:
1: In n1 * n2 : NAs produced by integer overflow
2: In n1 * n2 : NAs produced by integer overflow

> x$estimate
[1] -0.0005022877

我可能可以找到一种不使用effsize来实现计算的方法，但我看不到有任何方法可以绕过需要除以15141173728(228691*66208)的问题，我不知道怎么做，因为这会导致整数溢出.此外，分子还可能导致整数溢出.

有没有办法允许我处理大量数字，这样我就可以在我的数据中计算悬崖三角洲？

编辑:

我听从了普布尔斯的建议，找到了effsize:cliff.delta个代码中计算d的部分.effsize:cliff.delta返回10的列表，如果唯一需要的是d，则不会出现任何警告:

set.seed(123)
vector1 <- runif(228691)
vector2 <- runif(66208)

#Remove comments to replicate subsample sanity check
#vector1 <- vector1[1:1000]
#vector2 <- vector2[1:1000]

.bsearch.partition <- function(x, a, b = 1, e = length(a)) {
  n <- length(x)
  low <- rep(NA, n)
  L <- rep(b, n)
  H <- rep(e, n)
  
  repeat {
    M <- as.integer((L + H) / 2)
    left <- x <= a[M]
    H[left] <- M[left]
    L[!left] <- M[!left] + 1
    if (all(H <= L)) {
      break
    }
  }
  
  H <- L
  repeat {
    below <- a[H] == x
    below[is.na(below)] <- FALSE
    if (!any(below)) 
      break
    H[below] <- H[below] + 1
  }
  
  repeat {
    L.clean <- L
    L.clean[L.clean < 1] <- NA
    above <- a[L.clean] >= x
    above[is.na(above)] <- FALSE
    if (!any(above)) 
      break
    L[above] <- L[above] - 1
  }
  
  if (any(L == H)){
    H[H == L] <- L[H == L] + 1
  }
  H[H > length(a) + 1] <- length(a) + 1
  cbind(below = L, above = H)
}

treatment <- sort(vector1)
control <- sort(vector2)

n1 <- length(treatment)
n2 <- length(control)

partitions <- .bsearch.partition(treatment, control)
partitions[, 2] <- n2 - partitions[, 2] + 1L
partitions[partitions[, 1] > n2, 1] <- n2

d_i. <- mean(partitions %*% c(1L, -1L) / n2)
d <- mean(d_i.)

> d
[1] -0.0005022877

log.cliff.delta <- function(x, y) { ## bigger <- sum(outer(x, y, `>`)) ## Needs 113 Gb memory bigger <- sum(vapply(y, \(yo) sum(x > yo), numeric(1))) smaller <- sum(vapply(y, \(yo) sum(x < yo), numeric(1))) sign <- c(1, -1)[1+(bigger < smaller)] log_d <- log(abs(bigger - smaller)) - log(length(x)) - log(length(y)) sign*exp(log_d) } ## Subsample sanity check effsize::cliff.delta(a[1:1000], b[1:1000]) #> delta estimate: -0.016404 (negligible) log.cliff.delta(a[1:1000], b[1:1000]) #> -0.016404 ## The whole nine yards effsize::cliff.delta(a, b) #> delta estimate: -0.0005022877 (negligible) #> Warning messages: #> 1: In n1 * n2 : NAs produced by integer overflow #> 2: In n1 * n2 : NAs produced by integer overflow log.cliff.delta(a, b) ## is *not* fast #> -0.0005022877

R 悬崖三角洲超大型群数计算导致整数溢出

编辑:

推荐答案

R相关问答推荐

导入到固定列宽的R中时出现问题

使用gggplot 2在R中重新调整面板和y轴文本大小

ggplot 2中的地块底图(basemaps_gglayer()不起作用)

计算R中的威布尔分布的EDF

获取列中值更改的行号

lightgbm发动机在tidymmodels中的L1正则化""

如何在geom_col中反转条

使用rvest从多个页面抓取时避免404错误

ComplexHEAT：使用COLUMN_SPLIT时忽略COLUMN_ORDER

根据列表中项目的名称合并数据框和列表

在gggraph中显示来自不同数据帧的单个值

我将工作代码重构为一个函数--现在我想不出如何传递轴列参数

远离理论值的伽马密度曲线下面积的近似

判断函数未加载R中的库

自定义交互作用图的标签

在使用ggplot2的情况下，如何在使用coord_trans函数的同时，根据未转换的坐标比来定位geom_瓷砖？

如何在一种 colored颜色中设置数值变量的 colored颜色和高于阈值的 colored颜色点？

在子图内和子图之间对齐行数不均匀的表格罗布对

带查找数据的FCT_REORDER.帧

如何在分组蜂群小区中正确定位标签