我正在努力获取按进度统计活动的学生专栏.
STUDENT_ID STUDENT_ACTIVITY_SESSION_ID NODE_NAME ACTIVITY_NAME prog_level FredID gobbledeegook1 Node1 MyActivity1 pass FredID gobbledeegook2 Node1 MyActivity1 pass FredID gobbledeegook3 Node2 MyActivity2 pass JaniceID gobbledeegook4 Node3 MyActivity3 stay JaniceID gobbledeegook5 Node3 MyActivity3 stay JaniceID gobbledeegook5 Node3 MyActivity3 fail
STUDENT_ID attempts_pass attempts_fail attempts_stay FredID 3 JaniceID 1 2
- 我try 遍历变量名,以便变量名是自动的.我希望每一行都是一个学生ID,而计数是一列
def std_attempts_by_prog_level(df):
dict_fields = {}
df_by_prog_level = df.groupby('prog_level')['STUDENT_ACTIVITY_SESSION_ID']
for name, group in df_by_prog_level:
x = group.count()
dict_fields["attempts_" + name] = x
return pd.Series(dict_fields)
df.groupby('STUDENT_ID').apply(std_attempts_by_prog_level).reset_index()
结果:
STUDENT_ID level_1 0 0 Fred attempts_cancel 104 1 Fred attempts_fail 96 2 Fred attempts_in_progress 30
...所以这将需要旋转和摆弄,所以我试着从旋转的方法来处理它
- 轴心法和手动命名字段:生成的多索引不会让我容易地与其他学生指标合并回go
df_temp=df.groupby(['STUDENT_ID', 'prog_level'],as_index=False)['STUDENT_ACTIVITY_SESSION_ID'].count().pivot(index='STUDENT_ID', columns='prog_level').rename({'cancel':'attempts_cancel', 'fail':'attempts_fail', 'in_progress':'attempts_in_progress', 'pass':'attempts_pass'}, axis=1)
print(df_temp.columns)
结果:
MultiIndex([('STUDENT_ACTIVITY_SESSION_ID', 'attempts_cancel'),
('STUDENT_ACTIVITY_SESSION_ID', 'attempts_fail'),
('STUDENT_ACTIVITY_SESSION_ID', 'attempts_in_progress'),
('STUDENT_ACTIVITY_SESSION_ID', 'attempts_pass')],
names=[None, 'prog_level'])