相当于
df %>% groupby(col1) %>% summarize(col2_agg=max(col2), col3_agg=min(col3))
是
df.groupby('col1').agg({'col2': 'max', 'col3': 'min'})
它回来了
col2 col3
col1
1 5 -5
2 9 -9
The returning object 是 a pandas.DataFrame with an index called col1
and columns named col2
and col3
. By default, when you group your data pandas sets the grouping column(s) as index for efficient access and modification. However, if you don't want that, there are two alternatives to set col1
as a column.
通过as_index=False
分:
df.groupby('col1', as_index=False).agg({'col2': 'max', 'col3': 'min'})
拨打reset_index
:
df.groupby('col1').agg({'col2': 'max', 'col3': 'min'}).reset_index()
两者都会屈服
col1 col2 col3
1 5 -5
2 9 -9
您还可以将多个函数传递给groupby.agg
.
agg_df = df.groupby('col1').agg({'col2': ['max', 'min', 'std'],
'col3': ['size', 'std', 'mean', 'max']})
Th是 also returns a DataFrame but now it has a MultiIndex for columns.
col2 col3
max min std size std mean max
col1
1 5 1 1.581139 5 1.581139 -3 -1
2 9 0 3.535534 5 3.535534 -6 0
MultiIndex 是 very handy for selection and grouping. Here are some examples:
agg_df['col2'] # select the second column
max min std
col1
1 5 1 1.581139
2 9 0 3.535534
agg_df[('col2', 'max')] # select the maximum of the second column
Out:
col1
1 5
2 9
Name: (col2, max), dtype: int64
agg_df.xs('max', ax是=1, level=1) # select the maximum of all columns
Out:
col2 col3
col1
1 5 -1
2 9 0
早些时候(version 0.20.0年之前),可以使用字典来重命名agg
调用中的列.例如
df.groupby('col1')['col2'].agg({'max_col2': 'max'})
将第二列的最大值返回为max_col2
:
max_col2
col1
1 5
2 9
然而,人们反对使用重命名方法:
df.groupby('col1')['col2'].agg(['max']).rename(columns={'max': 'col2_max'})
col2_max
col1
1 5
2 9
对于上面定义的agg_df
这样的数据帧,它可能会变得冗长.在这种情况下,您可以使用重命名功能展平这些级别:
agg_df.columns = ['_'.join(col) for col in agg_df.columns]
col2_max col2_min col2_std col3_size col3_std col3_mean col3_max
col1
1 5 1 1.581139 5 1.581139 -3 -1
2 9 0 3.535534 5 3.535534 -6 0
对于像groupby().summarize(newcolumn=max(col2 * col3))
这样的操作,仍然可以使用agg,方法是首先添加一个带有assign
的新列.
df.assign(new_col=df.eval('col2 * col3')).groupby('col1').agg('max')
col2 col3 new_col
col1
1 5 -1 -1
2 9 0 0
Th是 returns maximum for old and new columns but as always you can slice that.
df.assign(new_col=df.eval('col2 * col3')).groupby('col1')['new_col'].agg('max')
col1
1 -1
2 0
Name: new_col, dtype: int64
With groupby.apply
th是 would be shorter:
df.groupby('col1').apply(lambda x: (x.col2 * x.col3).max())
col1
1 -1
2 0
dtype: int64
However, groupby.apply
treats th是 as a custom function so it 是 not vectorized. Up to now, the functions we passed to agg
('min', 'max', 'min', 'size' etc.) are vectorized and these are aliases for those optimized functions. You can replace df.groupby('col1').agg('min')
with df.groupby('col1').agg(min)
, df.groupby('col1').agg(np.min)
or df.groupby('col1').min()
and they will all execute the same function. You will not see the same efficiency when you use custom functions.
最后,从版本0.20开始,agg
可以直接用于数据帧,而无需先分组.参见示例here.