我正在使用R编程语言.

我正在使用R编程语言. I have the following dataset - students take an exam multiple times, they either pass ("1") or fail ("0"). The data looks something like this:

id = sample.int(10000, 100000, replace = TRUE)
res = c(1,0)
results = sample(res, 100000, replace = TRUE)
date_exam_taken = sample(seq(as.Date('1999/01/01'), as.Date('2020/01/01'), by="day"), 100000, replace = TRUE)


my_data = data.frame(id, results, date_exam_taken)
my_data <- my_data[order(my_data$id, my_data$date_exam_taken),]

my_data$general_id = 1:nrow(my_data)
my_data$exam_number = ave(my_data$general_id, my_data$id, FUN = seq_along)
my_data$general_id = NULL

      id results date_exam_taken exam_number
63018  1       0      2001-08-15           1
72324  1       1      2002-09-03           2
98866  1       0      2003-01-13           3
56137  1       1      2005-06-15           4
77746  1       0      2007-06-26           5
21438  1       0      2011-09-23           6

然后,我将数据转换为以下格式:

library(tidyr)

my_data = my_data %>% 
  pivot_wider(id, names_from = "exam_number", values_from = "results")

# A tibble: 10,000 x 24
      id   `1`   `2`   `3`   `4`   `5`   `6`   `7`   `8`   `9`  `10`  `11`  `12`  `13`  `14`  `15`  `16`  `17`  `18`  `19`  `20`  `21`  `22`  `23`
   <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1     1     0     1     0     1     0     0     0     1     0     1    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 2     2     1     0     1     1     0     0    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 3     3     1     0     1     0     1     1     1     1     0     1     1     1     0     0     0     1     1     1    NA    NA    NA    NA    NA
 4     4     1     1     0     0     0     1     1    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 5     5     1     0     1     0     0     1     0     0     0     0     1    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 6     6     1     1     0     1     1     0     0     1     0     0     1    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 7     7     0     0     1     1     0     1     1     0     1     0    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 8     8     0     1     0     1     0     1     0     1     1    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 9     9     0     0     0     0     0     0     1     1     0     1     0    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
10    10     0     0     1     1     1    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
# ... with 9,990 more rows

现在,假设我有以下序列:

my_grid= expand.grid(0:1, 0:1, 0:1)
n = nrow(my_grid)
n = c(1:n)

my_grid$sequence = paste("sequence", n)
my_grid$seq = paste0(my_grid$Var1, my_grid$Var2, my_grid$Var3)

     Var1 Var2 Var3 sequence seq
1    0    0    0 sequence 1  000
2    1    0    0 sequence 2  100
3    0    1    0 sequence 3  010
4    1    1    0 sequence 4  110
5    0    0    1 sequence 5  001
6    1    0    1 sequence 6  101
7    0    1    1 sequence 7  011
8    1    1    1 sequence 8  111

GOAL: Within the entire dataset, I want to find out the number of times each sequence appears (at the row level). For example, given that a student in this population failed two consecutive tests (e.g. failed tests 4&5, failed test 1&2) - what is the probability that such a student will also fail the next test?

我试着用下面的方法来解决这个问题--我把每个学生的考试成绩连接成一个单独的字符串,然后把它排成一个新的行.这将使识别所需模式变得更容易:

my_list = list()
for (i in 1:length(1:nrow(my_data)))
{
 val_i = paste(my_data[i,-1],collapse="")
print(val_i)
 my_list[[i]] = val_i
}

my_data$cols <- my_list

my_fun <- function(seq, data){
return(lengths(gregexpr(seq, data)))
}

PROBLEM: Then, I tried to apply this function to obtain the final counts - but I am getting this error:

#PROBLEM
my_grid$counts = mapply(my_fun, c(my_grid$seq), my_data$cols)
Error in input[i, ] : incorrect number of dimensions 

理想情况下,我希望最终结果如下所示(从这里,我可以简单地计算条件概率):

# FINAL RESULT
  Var1 Var2 Var3   sequence seq counts
1    0    0    0 sequence 1 000    ...
2    1    0    0 sequence 2 100    ...
3    0    1    0 sequence 3 010    ...
4    1    1    0 sequence 4 110    ...
5    0    0    1 sequence 5 001    ...
6    1    0    1 sequence 6 101    ...
7    0    1    1 sequence 7 011    ...
8    1    1    1 sequence 8 111    ...

QUESTION: Can someone please show me what I am doing wrong and what I can do to fix this?

谢谢!

  • 注意1:我没有使用函数,而是try 使用for循环来完成此操作.

以下是我编写的代码:

my_list = list()
for (i in 1:length(my_grid$seq))
{
    seq_i = my_grid$seq[i]
    val_i = sum(lengths(gregexpr(seq_i, my_data$cols)))
    print(c(i, seq_i, val_i))
}

[1] "1"     "000"   "11255"
[1] "2"     "100"   "12743"
[1] "3"     "010"   "12145"
[1] "4"     "110"   "12676"
[1] "5"     "001"   "12765"
[1] "6"     "101"   "12085"
[1] "7"     "011"   "12672"
[1] "8"     "111"   "11201"

但出于某种原因,我不认为这是正确的(即计数看起来相当高)?

  • 注2:我还试图确保条件概率是使用单个学生的分数计算的,而不是通过将所有学生的分数"聚集"在一起来计算的.

例如.

student 1 = 1,1,0,0,1,0,0
student 2 = 1,0,0,1,1,1,0

将这两个学生的分数合并为一个字符串"1,1,0,0,1,0,0, 1,0,0,1,1,1,0",然后计算频率计数-我想在学生级别计算这些计数,然后将它们加在一起,就是incorrect.

推荐答案

问题可能是,当存在Nomatch时,gregexpr也会返回-1.当我们使用lengths时,它将被计为1,这将使计数inflating 为sum.如果我们将函数更改为

 my_fun <- function(seq, data){
    sum(lengths(lapply(gregexpr(seq, data), function(x) x[x != -1])))   
  }

然后,我们使用此函数作为

library(dplyr)
my_grid %>% 
  rowwise %>% 
  mutate(counts = my_fun(seq, my_data$cols)) %>%
  ungroup
# A tibble: 8 × 6
   Var1  Var2  Var3 sequence   seq   counts
  <int> <int> <int> <chr>      <chr>  <int>
1     0     0     0 sequence 1 000     6215
2     1     0     0 sequence 2 100    10018
3     0     1     0 sequence 3 010     8325
4     1     1     0 sequence 4 110     9939
5     0     0     1 sequence 5 001    10072
6     1     0     1 sequence 6 101     8274
7     0     1     1 sequence 7 011     9973
8     1     1     1 sequence 8 111     6097  

即使当我们测试前6个元素时,也有2个不匹配的情况返回-1

> gregexpr(my_grid$seq[1], head(my_data$cols))
[[1]]
[1] 5
attr(,"match.length")
[1] 3
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

[[2]]
[1] -1
attr(,"match.length")
[1] -1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

[[3]]
[1] 5
attr(,"match.length")
[1] 3
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

[[4]]
[1] 1
attr(,"match.length")
[1] 3
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

[[5]]
[1] 4
attr(,"match.length")
[1] 3
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

[[6]]
[1] -1
attr(,"match.length")
[1] -1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

R相关问答推荐

无法运行通过R中的Auto.arima获得的ARIMA模型

使用tidyverse / Mutate的存款账户余额

大规模重新标记haven标签数据

使用R中相同值创建分组观测指标

隐藏e_mark_line的工具提示

如何在R forestplot中为多条垂直线分配唯一的 colored颜色 ?

将重复项转换为NA

如何通过匹配R中所有可能的组合来从宽到长旋转多个列?

在数据帧列表上绘制GGPUP

从多个可选列中选取一个值到一个新列中

如何从向量构造一系列双边公式

调换行/列并将第一行(原始数据帧的第一列)提升为标题的Tidyr类似功能?

使用列中的值来调用函数调用中应使用的其他列

如何在R中创建这些列?

在REST API中使用参数R

将边列表转换为路径长度列表

使用dqur在不同变量上创建具有多个条件的变量

删除r中每个因素级别的最后2行

修复标签重叠和ggploy内的空间

使用点图调整离散轴比例