R 构建具有项目共存频率的稀疏矩阵(用于分析产品的交叉销售)

发布于07月13日

我坚持创建一个稀疏矩阵，在这个矩阵中，我可以根据购物车和产品ID计算产品的交叉销售频率.

示例数据帧:

x = data.frame(
      cart_id = c("1","1","1","2","2","3","4","5","5","6"),
      product_id = c("A","B","C","D","A","F","G","A","C","F")
)

最理想的输出:一个稀疏矩阵，其中包含两个产品出现在同一购物车中的次数.

有什么提示吗？

EDIT:

这两个答案都解决了问题.

推荐答案

这是一个very interesting个问题/应用程序！！！

您的两列数据.第x帧显示了购物车中的产品，但您对products 101 and 102 fall into the same cart中的事件感兴趣.你不在乎它是什么车；相反，您要计算此类事件发生的次数.

当然，您的预期输出是一个列联表(带有计数的方阵).然而，必须首先计算计数，这不是一项简单的任务.下面注释良好的函数可以实现这一点.

Contingency <- function (product_id, cart_id) {
  ## unique product ID
  ProductID <- unique(product_id)
  ## let's use a consecutive numeric ID for product
  ProductIDnum <- match(product_id, ProductID)
  ## split products by cart
  CartItems <- unname(split(ProductIDnum, cart_id))
  ## number of products in each cart
  nItemsPerCart <- lengths(CartItems)
  ## we are only interested in carts with 2+ different products
  CartItems <- CartItems[nItemsPerCart >= 2]
  CartItems <- lapply(CartItems, sort)
  ## an event: a pair of products (i, j) fall into one same cart
  ## (note that we don't care which particular cart it is)
  ## here, `Events` is a 2-column matrix where each row is an event
  ## this matrix will have duplicated rows so that we can `aggregate`
  Events <- t(do.call("cbind", lapply(CartItems, combn, m = 2)))
  ## aggregate: how many times does each event happen?
  Freq <- aggregate(rep(1, nrow(Events)), data.frame(Events), sum)
  ## (i, j, x) triplet for a "TsparseMatrix"
  i <- Freq[[1]]
  j <- Freq[[2]]
  x <- Freq[[3]]
  ## the dimension of the square matrix
  n <- length(ProductID)
  Matrix::sparseMatrix(i = i, j = j, x = x, symmetric = TRUE, dims = c(n, n),
                       dimnames = list(ProductID, ProductID))
}

现在我们可以将其应用于数据集x.

mat <- Contingency(x$product_id, x$cart_id)
#6 x 6 sparse Matrix of class "dsCMatrix"
#  A B C D F G
#A . 1 2 1 . .
#B 1 . 1 . . .
#C 2 1 . . . .
#D 1 . . . . .
#F . . . . . .
#G . . . . . .

## dense form (not recommended if there are lots of products)
as.matrix(mat)
#  A B C D F G
#A 0 1 2 1 0 0
#B 1 0 1 0 0 0
#C 2 1 0 0 0 0
#D 1 0 0 0 0 0
#F 0 0 0 0 0 0
#G 0 0 0 0 0 0

您也可以使用xtabs和crossprod:

mat <- Matrix::crossprod(xtabs(~ ., data = x, sparse = TRUE))
#6 x 6 sparse Matrix of class "dsCMatrix"
#  A B C D F G
#A 3 1 2 1 . .
#B 1 1 1 . . .
#C 2 1 2 . . .
#D 1 . . 1 . .
#F . . . . 2 .
#G . . . . . 1

剩下的唯一一件事是将对角线项设置为零:

diag(mat) <- 0
mat
#  A B C D F G
#A 0 1 2 1 . .
#B 1 0 1 . . .
#C 2 1 0 . . .
#D 1 . . 0 . .
#F . . . . 0 .
#G . . . . . 0

但是请注意，"diag<-"在这里做得不是很好，因为在存储意义上，替换0不被视为零.

该死！！！我刚刚发现了一个骗局:Creating co-occurrence matrix.

R相关问答推荐

使用列表列作为case_when LHS的输入

如何创建具有总计列和ggplot 2所有条线的百分比标签的堆叠条形图？

从具有随机模式的字符串中提取值

使用scale_x_continuous复制ggplot 2中的离散x轴

跨列应用多个摘要函数：summerise_all：列表对象无法强制为double类型'

保存包含循环和ifelse的函数的输出

隐藏e_mark_line的工具提示

用黄土法确定区间

使用tidy—select创建一个新的带有mutate的摘要变量

在R中无法读入具有Readxl和lApply的数据集

使用R中的正则表达式将一列分割为多列

R-按最接近午夜的时间进行筛选

R如何计算现有行的总和以添加新的数据行

如何使用前缀作为匹配来连接数据帧？

如何计算增加10米(0.01公里)的行？

远离理论值的伽马密度曲线下面积的近似

为什么在写入CSV文件时Purrr：：Pwalk不起作用

将文本批注减少到gglot的y轴上的单个值

通过R：文件名未正确写入[已解决]将.nc文件转换和导出为.tif文件

如何根据其他列中的两个条件来计算数据帧中的行之间的差异？

实用课程推荐