Python 如何将新的数据框列添加到数据框中每个组的递增整数

发布于05月18日

假设我有以下数据帧:

date          group         value
2022-11-01.     1              4
2022-11-02.     1              12
2022-11-03.     1              14
2022-11-04.     1              25
2021-11-01.     2              9
2021-11-02.     2              7
2019-10-01.     3              40
2022-10-02.     3              14

我想 for each 组创建一个基于日期递增整数的新列.例如，这是所需的输出:

  date          group         value      new_col
    2022-11-01.     1              4.      0
    2022-11-02.     1              12.     1
    2022-11-03.     1              14.     2
    2022-11-04.     1              25.     3
    2021-11-01.     2              9.      0
    2021-11-02.     2              7.      1
    2019-10-01.     3              40.     0
    2022-10-02.     3              14.     1

您看，new_col1大概是np,arange(0, len(df['date'])+1)个，但是我想按组来做，而且似乎没有任何Groupby的变体适合我.

我试过了:

df.groupby('group')['date'].apply(lambda x: np.arange(0, len(x)+1)

然而，这与我想要的还差得很远.如果有人能解释如何正确地做这件事，我将不胜感激.

推荐答案

有没有其他方法可以使用np.arange(0，len(X)+1)和groupby来解决这个问题？

我更改了See Difference-GroupBy.rank的数据，使用列date的顺序，因此不同的输出使用计数器GroupBy.cumcount和您的解决方案GroupBy.transform:

print (df)
        date  group  value
0 2022-11-08      1      4
1 2022-11-07      1     12
2 2022-11-03      1     14
3 2022-11-04      1     25
4 2021-11-21      2      9
5 2021-11-02      2      7
6 2019-10-01      3     40
7 2022-10-02      3     14

df['new_col'] = df.groupby('group')['date'].rank('dense').sub(1).astype(int)

df['new_col1'] = df.groupby('group').cumcount()

df['new_col2'] = df.groupby('group')['date'].transform(lambda x: np.arange(len(x)))
print (df)
        date  group  value  new_col  new_col1  new_col2
0 2022-11-08      1      4        3         0         0
1 2022-11-07      1     12        2         1         1
2 2022-11-03      1     14        0         2         2
3 2022-11-04      1     25        1         3         3
4 2021-11-21      2      9        1         0         0
5 2021-11-02      2      7        0         1         1
6 2019-10-01      3     40        0         0         0
7 2022-10-02      3     14        1         1         1

如果希望相同的输出解决方案按两列排序:

df = df.sort_values(['group','date'])

df['new_col'] = df.groupby('group')['date'].rank('dense').sub(1).astype(int)

df['new_col1'] = df.groupby('group').cumcount()

df['new_col2'] = df.groupby('group')['date'].transform(lambda x: np.arange(len(x)))
print (df)
        date  group  value  new_col  new_col1  new_col2
2 2022-11-03      1     14        0         0         0
3 2022-11-04      1     25        1         1         1
1 2022-11-07      1     12        2         2         2
0 2022-11-08      1      4        3         3         3
5 2021-11-02      2      7        0         0         0
4 2021-11-21      2      9        1         1         1
6 2019-10-01      3     40        0         0         0
7 2022-10-02      3     14        1         1         1