Python 使用 Pandas 进行棘手的 max min groupby 转换

发布于06月07日

我希望按ID分组，并找到每个月的最小和最大计数.

Data个

DATE        ID  name    
4/30/2023   AA  hi  
4/5/2023    AA  hi  
4/1/2023    AA  hi  
4/1/2023    AA  hello   
4/30/2023   AA  hello   
4/5/2023    AA  hello   
4/5/2023    AA  hey 
4/30/2023   AA  hey 
4/5/2023    AA  ok  
4/30/2023   AA  ok  
4/30/2023   AA  ok  
5/1/2023    AA  ok
5/1/2023    AA  hey
5/25/2023   AA  hi
4/1/2023    BB  hey 
4/2/2023    BB  hi  
4/2/2023    BB  hello

Desired个

ID  DATE        stat    count
AA  4/1/2023    min     2
AA  4/30/2023   max     5
AA  5/25/2023   min     1
AA  5/1/2023    max     2
BB  4/1/2023    min     1
BB  4/2/2023    max     2

Doing个

result = df.groupby(['ID', 'DATE', 'name']).size().reset_index(name='count')
result['stat'] = result.groupby(['ID', 'DATE'])['count'].transform(lambda x: 'min' if x.idxmin() == x.idxmax() else 'max')

然而，这并没有说明日期.如有任何建议，欢迎光临.

推荐答案

这应该会起作用(双关语；])

import pandas as pd
import numpy as np

data = [["4/30/2023", "AA", "hi"]] # Fill rest of data, this was for testing
df = pd.DataFrame(data, columns=['DATE', 'ID', 'name'])

# Convert 'DATE' column to datetime format and extract month and year
df['DATE'] = pd.to_datetime(df['DATE'])
df['month'] = df['DATE'].dt.month
df['year'] = df['DATE'].dt.year

# Group by 'ID', 'month', and 'year' and calculate the count of names
result = df.groupby(['ID', 'year', 'month', 'DATE'])['name'].size().reset_index(name='count')

# Find the min and max counts for each ID and month combination
result_min = result.groupby(['ID', 'year', 'month'])['count'].min().reset_index(name='min_count')
result_max = result.groupby(['ID', 'year', 'month'])['count'].max().reset_index(name='max_count')

# Merge the min and max counts with the original result DataFrame
result = result.merge(result_min, on=['ID', 'year', 'month']).merge(result_max, on=['ID', 'year', 'month'])

# Create a 'stat' column based on the min and max counts
result['stat'] = np.where(result['count'] == result['min_count'], 'min', 'max')

# Drop unnecessary columns and reset index
result = result.drop(columns=['min_count', 'max_count']).reset_index(drop=True)

# We use .to_string() so we can remove the pandas indexing on the left of the print
print(result[['ID', 'DATE', 'stat', 'count']].to_string(index=False))