Python 为列PANAS中的每个字符串值创建累计和列

发布于02月20日

我有以下Pandas 数据框

import pandas as pd
import random

random.seed(42)

pd.DataFrame({'index': list(range(0,10)),
         'cluster': [random.choice(['S',  'C', ]) for l in range(0,10)]})


    index   cluster
0   0   S
1   1   S
2   2   C
3   3   S
4   4   S
5   5   S
6   6   S
7   7   S
8   8   C
9   9   S

我想创建5个新列，cluster列的每个唯一值一个，这将是每个值出现的累计和.

pandas的输出框架应该如下所示:

pd.DataFrame({'index': list(range(0,10)),
             'cluster': [random.choice(['S',  'C', ]) for l in range(0,10)],
             'cumulative_S': [1,2,2,3,4,5,6,7,7,8],
             'cumulative_C': [0,0,1,1,1,1,1,1,2,2]})

index   cluster cumulative_S    cumulative_C
0   0   S   1   0
1   1   S   2   0
2   2   C   2   1
3   3   S   3   1
4   4   S   4   1
5   5   S   5   1
6   6   S   6   1
7   7   S   7   1
8   8   C   7   2
9   9   S   8   2

我怎么才能做到这一点？

推荐答案

Code个

df是您的输入数据帧

tmp = (pd.get_dummies(df['cluster'])
       .cumsum()[df['cluster'].unique()]
       .add_prefix('cumulative_')
)
输出 = pd.concat([df, tmp],axis=1)

输出

   index cluster  cumulative_S  cumulative_C
0      0       S             1             0
1      1       S             2             0
2      2       C             2             1
3      3       S             3             1
4      4       S             4             1
5      5       S             5             1
6      6       S             6             1
7      7       S             7             1
8      8       C             7             2
9      9       S             8             2