这是一个very interesting个问题/应用程序!!!
您的两列数据.第x
帧显示了购物车中的产品,但您对products 101 and 102 fall into the same cart中的事件感兴趣.你不在乎它是什么车;相反,您要计算此类事件发生的次数.
当然,您的预期输出是一个列联表(带有计数的方阵).然而,必须首先计算计数,这不是一项简单的任务.下面注释良好的函数可以实现这一点.
Contingency <- function (product_id, cart_id) {
## unique product ID
ProductID <- unique(product_id)
## let's use a consecutive numeric ID for product
ProductIDnum <- match(product_id, ProductID)
## split products by cart
CartItems <- unname(split(ProductIDnum, cart_id))
## number of products in each cart
nItemsPerCart <- lengths(CartItems)
## we are only interested in carts with 2+ different products
CartItems <- CartItems[nItemsPerCart >= 2]
CartItems <- lapply(CartItems, sort)
## an event: a pair of products (i, j) fall into one same cart
## (note that we don't care which particular cart it is)
## here, `Events` is a 2-column matrix where each row is an event
## this matrix will have duplicated rows so that we can `aggregate`
Events <- t(do.call("cbind", lapply(CartItems, combn, m = 2)))
## aggregate: how many times does each event happen?
Freq <- aggregate(rep(1, nrow(Events)), data.frame(Events), sum)
## (i, j, x) triplet for a "TsparseMatrix"
i <- Freq[[1]]
j <- Freq[[2]]
x <- Freq[[3]]
## the dimension of the square matrix
n <- length(ProductID)
Matrix::sparseMatrix(i = i, j = j, x = x, symmetric = TRUE, dims = c(n, n),
dimnames = list(ProductID, ProductID))
}
现在我们可以将其应用于数据集x
.
mat <- Contingency(x$product_id, x$cart_id)
#6 x 6 sparse Matrix of class "dsCMatrix"
# A B C D F G
#A . 1 2 1 . .
#B 1 . 1 . . .
#C 2 1 . . . .
#D 1 . . . . .
#F . . . . . .
#G . . . . . .
## dense form (not recommended if there are lots of products)
as.matrix(mat)
# A B C D F G
#A 0 1 2 1 0 0
#B 1 0 1 0 0 0
#C 2 1 0 0 0 0
#D 1 0 0 0 0 0
#F 0 0 0 0 0 0
#G 0 0 0 0 0 0
您也可以使用xtabs
和crossprod
:
mat <- Matrix::crossprod(xtabs(~ ., data = x, sparse = TRUE))
#6 x 6 sparse Matrix of class "dsCMatrix"
# A B C D F G
#A 3 1 2 1 . .
#B 1 1 1 . . .
#C 2 1 2 . . .
#D 1 . . 1 . .
#F . . . . 2 .
#G . . . . . 1
剩下的唯一一件事是将对角线项设置为零:
diag(mat) <- 0
mat
# A B C D F G
#A 0 1 2 1 . .
#B 1 0 1 . . .
#C 2 1 0 . . .
#D 1 . . 0 . .
#F . . . . 0 .
#G . . . . . 0
但是请注意,"diag<-"
在这里做得不是很好,因为在存储意义上,替换0不被视为零.
该死!!!我刚刚发现了一个骗局:Creating co-occurrence matrix.