我试图总结/聚合一个框架如下.虽然代码给出了正确的结果,但它非常重复,我想避免这种情况.我想我需要使用groupby
,agg
,apply
等,但找不到这样做的方法.目标是在最后计算df_summ
.我认为我使用了太多的中间嵌套来 Select 行,以及太多的merge
来把结果放在一起.我觉得一定有更简单的方法,但想不出来.
真实的df_stats
个输入数据帧有数百万行,df_summ
个输出数据帧有几十列.下面显示的输入只是一个最小的可重现示例.
import io
import pandas as pd
TESTDATA="""
enzyme regions N length
AaaI all 10 238045
AaaI all 20 170393
AaaI all 30 131782
AaaI all 40 103790
AaaI all 50 81246
AaaI all 60 62469
AaaI all 70 46080
AaaI all 80 31340
AaaI all 90 17188
AaaI captured 10 292735
AaaI captured 20 229824
AaaI captured 30 193605
AaaI captured 40 163710
AaaI captured 50 138271
AaaI captured 60 116122
AaaI captured 70 95615
AaaI captured 80 73317
AaaI captured 90 50316
AagI all 10 88337
AagI all 20 19144
AagI all 30 11030
AagI all 40 8093
AagI all 50 6394
AagI all 60 4991
AagI all 70 3813
AagI all 80 2759
AagI all 90 1666
AagI captured 10 34463
AagI captured 20 19220
AagI captured 30 15389
AagI captured 40 12818
AagI captured 50 10923
AagI captured 60 9261
AagI captured 70 7753
AagI captured 80 6201
AagI captured 90 4495
"""
df_stats = pd.read_csv(io.StringIO(TESTDATA), sep='\s+')
df_cap_N90 = df_stats[(df_stats['N'] == 90) & (df_stats['regions'] == 'captured')].drop(columns=['regions', 'N'])
df_cap_N50 = df_stats[(df_stats['N'] == 50) & (df_stats['regions'] == 'captured')].drop(columns=['regions', 'N'])
df_all_N50 = df_stats[(df_stats['N'] == 50) & (df_stats['regions'] == 'all') ].drop(columns=['regions', 'N'])
df_summ_cap_N50_all_N50 = pd.merge(df_cap_N50, df_all_N50, on='enzyme', how='inner', suffixes=('_cap_N50', '_all_N50'))
df_summ_cap_N50_all_N50['cap_N50_all_N50'] = (df_summ_cap_N50_all_N50['length_cap_N50'] -
df_summ_cap_N50_all_N50['length_all_N50'])
print(df_summ_cap_N50_all_N50)
df_summ_cap_N90_all_N50 = pd.merge(df_cap_N90, df_all_N50, on='enzyme', how='inner', suffixes=('_cap_N90', '_all_N50'))
df_summ_cap_N90_all_N50['cap_N90_all_N50'] = df_summ_cap_N90_all_N50['length_cap_N90'] - df_summ_cap_N90_all_N50['length_all_N50']
print(df_summ_cap_N90_all_N50)
df_summ = pd.merge(df_summ_cap_N50_all_N50.drop(columns=['length_cap_N50', 'length_all_N50']),
df_summ_cap_N90_all_N50.drop(columns=['length_cap_N90', 'length_all_N50']),
on='enzyme', how='inner')
print(df_summ)
打印:
enzyme length_cap_N50 length_all_N50 cap_N50_all_N50
0 AaaI 138271 81246 57025
1 AagI 10923 6394 4529
enzyme length_cap_N90 length_all_N50 cap_N90_all_N50
0 AaaI 50316 81246 -30930
1 AagI 4495 6394 -1899
enzyme cap_N50_all_N50 cap_N90_all_N50
0 AaaI 57025 -30930
1 AagI 4529 -1899
Notes on bioinformatics background behind this question:个
(Feel free to skip this, it describes the domain knowledge behind the python code)个
上面的代码是一个多步骤生物信息学项目的一个步骤,在这个项目中,我试图根据DNA切割的方式找到最佳的限制性内切Enzyme .
作为这一步的输入,我有一个带有限制Enzyme 的表(其名称存储在第enzyme
列).我想根据DNA切割方式的统计特性对Enzyme 进行排名.第regions
栏存储了两种不同的DNA区域类型,我想用这些Enzyme 来区分它们.第N
列是测量DNA切割的精细程度的统计量的名称(N10、...、N90),length
是该统计量的值.N
个统计数据总结了DNA片段长度分布(以核苷酸为单位测量),在精神上类似于分位数(10%,...,90%).当我比较Enzyme 的时候,我想做简单的运算,比如cap_N90_all_N50 = { captured N90 } - { all N50 }
,等等,然后我按cap_N50_all_N50
的组合对Enzyme 进行排序,等等.