我们可以使用
t(+sapply(gene_sets, "%in%", x = c("Gene1", "Gene2", "Gene3")))
如果你想动态获得c("Gene1", "Gene2", "Gene3")
,我们可以
GeneID <- sort(unique(unlist(gene_sets)))
mat <- t(+sapply(gene_sets, "%in%", x = GeneID)) ## matrix output
colnames(mat) <- GeneID
# Gene1 Gene2 Gene3
#pathwayX 0 0 1
#pathwayY 0 1 1
#pathwayz 1 1 1
data.frame(mat) ## data.frame output
我的印象是,基因问题通常是大而稀疏的.如果你在现实中有几十万个基因和通路,那么下面的稀疏矩阵解决方案是最好的 Select .
pathwayID <- names(gene_sets)
n1 <- lengths(gene_sets, use.names = FALSE) ## number of genes in each pathway
genesVec <- unlist(gene_sets, use.names = FALSE)
GeneID <- sort(unique(genesVec))
i <- rep(1:length(n1), n1)
j <- match(genesVec, GeneID)
Matrix::sparseMatrix(i = i, j = j, x = rep.int(1, length(i)),
dimnames = list(pathwayID, GeneID))
#3 x 3 sparse Matrix of class "dgCMatrix"
# Gene1 Gene2 Gene3
#pathwayX . . 1
#pathwayY . 1 1
#pathwayz 1 1 1