我正在使用R编程语言.
我正在使用R编程语言. I have the following dataset - students take an exam multiple times, they either pass ("1") or fail ("0"). The data looks something like this:
id = sample.int(10000, 100000, replace = TRUE)
res = c(1,0)
results = sample(res, 100000, replace = TRUE)
date_exam_taken = sample(seq(as.Date('1999/01/01'), as.Date('2020/01/01'), by="day"), 100000, replace = TRUE)
my_data = data.frame(id, results, date_exam_taken)
my_data <- my_data[order(my_data$id, my_data$date_exam_taken),]
my_data$general_id = 1:nrow(my_data)
my_data$exam_number = ave(my_data$general_id, my_data$id, FUN = seq_along)
my_data$general_id = NULL
id results date_exam_taken exam_number
63018 1 0 2001-08-15 1
72324 1 1 2002-09-03 2
98866 1 0 2003-01-13 3
56137 1 1 2005-06-15 4
77746 1 0 2007-06-26 5
21438 1 0 2011-09-23 6
然后,我将数据转换为以下格式:
library(tidyr)
my_data = my_data %>%
pivot_wider(id, names_from = "exam_number", values_from = "results")
# A tibble: 10,000 x 24
id `1` `2` `3` `4` `5` `6` `7` `8` `9` `10` `11` `12` `13` `14` `15` `16` `17` `18` `19` `20` `21` `22` `23`
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 1 0 1 0 0 0 1 0 1 NA NA NA NA NA NA NA NA NA NA NA NA NA
2 2 1 0 1 1 0 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
3 3 1 0 1 0 1 1 1 1 0 1 1 1 0 0 0 1 1 1 NA NA NA NA NA
4 4 1 1 0 0 0 1 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
5 5 1 0 1 0 0 1 0 0 0 0 1 NA NA NA NA NA NA NA NA NA NA NA NA
6 6 1 1 0 1 1 0 0 1 0 0 1 NA NA NA NA NA NA NA NA NA NA NA NA
7 7 0 0 1 1 0 1 1 0 1 0 NA NA NA NA NA NA NA NA NA NA NA NA NA
8 8 0 1 0 1 0 1 0 1 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
9 9 0 0 0 0 0 0 1 1 0 1 0 NA NA NA NA NA NA NA NA NA NA NA NA
10 10 0 0 1 1 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# ... with 9,990 more rows
现在,假设我有以下序列:
my_grid= expand.grid(0:1, 0:1, 0:1)
n = nrow(my_grid)
n = c(1:n)
my_grid$sequence = paste("sequence", n)
my_grid$seq = paste0(my_grid$Var1, my_grid$Var2, my_grid$Var3)
Var1 Var2 Var3 sequence seq
1 0 0 0 sequence 1 000
2 1 0 0 sequence 2 100
3 0 1 0 sequence 3 010
4 1 1 0 sequence 4 110
5 0 0 1 sequence 5 001
6 1 0 1 sequence 6 101
7 0 1 1 sequence 7 011
8 1 1 1 sequence 8 111
GOAL: Within the entire dataset, I want to find out the number of times each sequence appears (at the row level). For example, given that a student in this population failed two consecutive tests (e.g. failed tests 4&5, failed test 1&2) - what is the probability that such a student will also fail the next test?个
我试着用下面的方法来解决这个问题--我把每个学生的考试成绩连接成一个单独的字符串,然后把它排成一个新的行.这将使识别所需模式变得更容易:
my_list = list()
for (i in 1:length(1:nrow(my_data)))
{
val_i = paste(my_data[i,-1],collapse="")
print(val_i)
my_list[[i]] = val_i
}
my_data$cols <- my_list
my_fun <- function(seq, data){
return(lengths(gregexpr(seq, data)))
}
PROBLEM: Then, I tried to apply this function to obtain the final counts - but I am getting this error:个
#PROBLEM
my_grid$counts = mapply(my_fun, c(my_grid$seq), my_data$cols)
Error in input[i, ] : incorrect number of dimensions
理想情况下,我希望最终结果如下所示(从这里,我可以简单地计算条件概率):
# FINAL RESULT
Var1 Var2 Var3 sequence seq counts
1 0 0 0 sequence 1 000 ...
2 1 0 0 sequence 2 100 ...
3 0 1 0 sequence 3 010 ...
4 1 1 0 sequence 4 110 ...
5 0 0 1 sequence 5 001 ...
6 1 0 1 sequence 6 101 ...
7 0 1 1 sequence 7 011 ...
8 1 1 1 sequence 8 111 ...
QUESTION: Can someone please show me what I am doing wrong and what I can do to fix this?个
谢谢!
- 注意1:我没有使用函数,而是try 使用for循环来完成此操作.
以下是我编写的代码:
my_list = list()
for (i in 1:length(my_grid$seq))
{
seq_i = my_grid$seq[i]
val_i = sum(lengths(gregexpr(seq_i, my_data$cols)))
print(c(i, seq_i, val_i))
}
[1] "1" "000" "11255"
[1] "2" "100" "12743"
[1] "3" "010" "12145"
[1] "4" "110" "12676"
[1] "5" "001" "12765"
[1] "6" "101" "12085"
[1] "7" "011" "12672"
[1] "8" "111" "11201"
但出于某种原因,我不认为这是正确的(即计数看起来相当高)?
- 注2:我还试图确保条件概率是使用单个学生的分数计算的,而不是通过将所有学生的分数"聚集"在一起来计算的.
例如.
student 1 = 1,1,0,0,1,0,0
student 2 = 1,0,0,1,1,1,0
将这两个学生的分数合并为一个字符串"1,1,0,0,1,0,0, 1,0,0,1,1,1,0"
,然后计算频率计数-我想在学生级别计算这些计数,然后将它们加在一起,就是incorrect.