我有列表(元素数为偶数):

my.list = list(col1 = c("CC", "CT", "TT"), 
     col2 = c("GG", "GT"), 
     col3 = c("CC", "CT"),
     col4 = c("CC", "CG", "GG"), 
     col5 = c("AC", "CC"),
     col6 = "GG")

$col1
[1] "CC" "CT" "TT"

$col2
[1] "GG" "GT"

$col3
[1] "CC" "CT"

$col4
[1] "CC" "CG" "GG"

$col5
[1] "AC" "CC"

$col6
[1] "GG"

它可以转换为数据帧:

mylist.df = plyr::ldply(my.list, rbind)
names(mylist.df) <- c("cols","g1", "g2", "g3")

我想使用mylistmylist.df对下面的数据框进行子集.基本上,我希望保留每一值至少有一个元素的每一行:

df.to.subset = structure(list(IDs = c("ID1", "ID2", "ID3", "ID4", "ID5", "ID6"), 
               gr = c("gr1", "gr1", "gr1", "gr1", "gr1", "gr1"), 
               var = c(-3.451, -3.469, -3.837, -3.344, -3.904, -3.943), 
               col1 = structure(c(1L, 2L, 3L, 1L, 2L, 2L), levels = c("CC", "CT", "TT"), class = "factor"), 
               col2 = structure(c(1L, 1L, 2L, 3L, 3L, 3L), levels = c("GG", "GT", "TT"), class = "factor"), 
               col3 = structure(c(1L, 2L, 1L, 1L, 1L, 1L), levels = c("CC", "CT"), class = "factor"), 
               col4 = structure(c(2L, 2L, 2L, 2L, 2L, 2L), levels = c("CC", "CG", "GG"), class = "factor"), 
               col5 = structure(c(1L, 2L, 2L, 2L, 2L, 2L), levels = c("AC", "CC"), class = "factor"), 
               col6 = structure(c(1L, 1L, 2L, 1L, 1L, 1L), levels = c("GG","AA"), class = "factor")), 
          row.names = c(NA, 
                        -6L), class = c("tbl_df", "tbl", "data.frame"))

  IDs   gr      var col1  col2  col3  col4  col5  col6 
  <chr> <chr> <dbl> <fct> <fct> <fct> <fct> <fct> <fct>
1 ID1   gr1   -3.45 CC    GG    CC    CG    AC    GG   
2 ID2   gr1   -3.47 CT    GG    CT    CG    CC    GG   
3 ID3   gr1   -3.84 TT    GT    CC    CG    CC    AA   
4 ID4   gr1   -3.34 CC    TT    CC    CG    CC    GG   
5 ID5   gr1   -3.90 CT    TT    CC    CG    CC    GG   
6 ID6   gr1   -3.94 CT    TT    CC    CG    CC    GG   

(最终结果将是)

  IDs   gr      var col1  col2  col3  col4  col5  col6 
  ID1   gr1   -3.45 CC    GG    CC    CG    AC    GG   
  ID2   gr1   -3.47 CT    GG    CT    CG    CC    GG   

此外,我还想重新排列df.to.subset列中的每一列,以匹配此数据框中的级别:

factor.levels.cols = structure(list(cols = c("col1", "col2", "col3", "col4", "col5", "col6"), 
               g1 = c("CC", "GG", "CC", "CC", "AA", "AA"), 
               g2 = c("CT", "GT", "CT", "CG", "AC", "AG"), 
               g3 = c("TT", "TT", "TT", "GG", "CC", "GG")), 
          row.names = c(NA, 6L), class = "data.frame")

  cols g1 g2 g3
1 col1 CC CT TT
2 col2 GG GT TT
3 col3 CC CT TT
4 col4 CC CG GG
5 col5 AA AC CC
6 col6 AA AG GG

这里强制使用for循环吗,或者有什么方法可以让它更快吗?我有&gt;1,000,000个条目要修改.

推荐答案

下面只是一个基本的R选项来实现它.


我想你可以把df.to.subset分成下面的子集

out <- df.to.subset[rowMeans(mapply(`%in%`, df.to.subset[names(my.list)], my.list)) == 1, ]

这给了我们

# A tibble: 2 × 9
  IDs   gr      var col1  col2  col3  col4  col5  col6
  <chr> <chr> <dbl> <fct> <fct> <fct> <fct> <fct> <fct>
1 ID1   gr1   -3.45 CC    GG    CC    CG    AA    AA
2 ID2   gr1   -3.47 CT    GG    CT    CG    AC    AA   

如果要重新设置列的级别,可以try

lvls <- with(
    reshape(
        factor.levels.cols,
        direction = "long",
        idvar = "cols",
        varying = -1,
        v.names = "g"
    ),
    split(g, cols)
)
out[names(lvls)] <- Map(`levels<-`, out[names(lvls)], lvls)

您将看到级别被重置为所需的级别

> str(out)
tibble [2 × 9] (S3: tbl_df/tbl/data.frame)
 $ IDs : chr [1:2] "ID1" "ID2"
 $ gr  : chr [1:2] "gr1" "gr1"
 $ var : num [1:2] -3.45 -3.47
 $ col1: Factor w/ 3 levels "CC","CT","TT": 1 2
 $ col2: Factor w/ 3 levels "GG","GT","TT": 1 1
 $ col3: Factor w/ 3 levels "CC","CT","TT": 1 2
 $ col4: Factor w/ 3 levels "CC","CG","GG": 2 2
 $ col5: Factor w/ 3 levels "AA","AC","CC": 1 2
 $ col6: Factor w/ 3 levels "AA","AG","GG": 1 1

R相关问答推荐

geom_Ribbon条件填充创建与数据不匹配的形状(ggplot 2 r)

在另一个函数中调用ggplot2美学

如何将移除事件分配给动态创建的按钮?

筛选出以特定顺序患病的个体

如何在区分不同条件的同时可视化跨时间的连续变量?

当月份额减go 当月份额

IMF IFS数据以R表示

在df中保留原始变量和新变量

如何从像glm这样的模型中提取系数表的相关性?

如何通过ggplot2添加短轴和删除长轴?

如何基于两个条件从一列中提取行

是否有新方法来更改Facet_WRAP(Ggplot2)中条文本的文本 colored颜色 ?

列名具有特殊字符时的循环回归

如果COLSUM为>;0,则COLNAME为向量

如何提取R中其他字符串和数字之间的字符串?

如何调整一个facet_work()面板内的框图和移动标签之间的水平宽度?

隐藏基于 case 总数的值

如何编辑被动式数据表?

将Geojson保存为R中的shapefile

根据列和行的不同组合 Select 各种单元格