下面有两种方法可以解决您的问题,一种是使用pandas,另一种是使用numpy(UPDATED反映了OP在关于基于连续分组的bin
个值进行装箱的 comments 中的澄清):
res = df.assign(bin_index = (df.bin != df.bin.shift()).cumsum())
dfAggs = res[['butter', 'bin', 'bin_index']].groupby(['bin', 'bin_index']).agg([min, max])
dfAggs.columns = dfAggs.columns.droplevel()
res = res.join(dfAggs, on=['bin', 'bin_index']).drop(columns='bin_index')
print("", "pandas:", res, sep="\n")
a = df.copy().to_numpy()
print("", "input as numpy 2d array", a, sep="\n")
bin_index = a[:,2:3] != np.concatenate((np.full((1, 1), np.NaN), a[:-1,2:3]), axis = 0)
bin_index = np.cumsum(bin_index)
bins = np.unique(bin_index)
aggs = np.empty((a.shape[0], 2))
for b in bins:
mask = bin_index==b
aggs[mask, :] = (a[mask, 1].min(), a[mask, 1].max())
res = np.concatenate((a, aggs), axis=1)
print("", "numpy:", res, sep="\n")
输出:
input as pandas dataframe
time butter bin
0 2022-07-04 17:33:45+00:00 1.041967 3
1 2022-07-04 17:34:00+00:00 1.041967 4
2 2022-07-04 17:34:15+00:00 1.041966 4
3 2022-07-04 17:34:30+00:00 1.041967 4
4 2022-07-04 17:34:45+00:00 1.041968 4
5 2022-07-04 17:35:00+00:00 1.041969 4
6 2022-07-04 17:35:15+00:00 1.041971 4
7 2022-07-04 17:35:30+00:00 1.041973 4
8 2022-07-04 17:35:45+00:00 1.041975 4
9 2022-07-04 17:36:00+00:00 1.041977 5
10 2022-07-04 17:36:15+00:00 1.041979 5
11 2022-07-04 17:36:30+00:00 1.041981 5
12 2022-07-04 17:36:45+00:00 1.041983 5
13 2022-07-04 17:37:00+00:00 1.041985 5
14 2022-07-04 17:37:15+00:00 1.041986 6
15 2022-07-04 17:37:30+00:00 1.041987 6
16 2022-07-04 17:37:45+00:00 1.041988 6
17 2022-07-04 17:38:00+00:00 1.041989 6
18 2022-07-04 17:38:15+00:00 1.041990 4
19 2022-07-04 17:38:30+00:00 1.041995 4
pandas:
time butter bin min max
0 2022-07-04 17:33:45+00:00 1.041967 3 1.041967 1.041967
1 2022-07-04 17:34:00+00:00 1.041967 4 1.041966 1.041975
2 2022-07-04 17:34:15+00:00 1.041966 4 1.041966 1.041975
3 2022-07-04 17:34:30+00:00 1.041967 4 1.041966 1.041975
4 2022-07-04 17:34:45+00:00 1.041968 4 1.041966 1.041975
5 2022-07-04 17:35:00+00:00 1.041969 4 1.041966 1.041975
6 2022-07-04 17:35:15+00:00 1.041971 4 1.041966 1.041975
7 2022-07-04 17:35:30+00:00 1.041973 4 1.041966 1.041975
8 2022-07-04 17:35:45+00:00 1.041975 4 1.041966 1.041975
9 2022-07-04 17:36:00+00:00 1.041977 5 1.041977 1.041985
10 2022-07-04 17:36:15+00:00 1.041979 5 1.041977 1.041985
11 2022-07-04 17:36:30+00:00 1.041981 5 1.041977 1.041985
12 2022-07-04 17:36:45+00:00 1.041983 5 1.041977 1.041985
13 2022-07-04 17:37:00+00:00 1.041985 5 1.041977 1.041985
14 2022-07-04 17:37:15+00:00 1.041986 6 1.041986 1.041989
15 2022-07-04 17:37:30+00:00 1.041987 6 1.041986 1.041989
16 2022-07-04 17:37:45+00:00 1.041988 6 1.041986 1.041989
17 2022-07-04 17:38:00+00:00 1.041989 6 1.041986 1.041989
18 2022-07-04 17:38:15+00:00 1.041990 4 1.041990 1.041995
19 2022-07-04 17:38:30+00:00 1.041995 4 1.041990 1.041995
input as numpy 2d array
[['2022-07-04 17:33:45+00:00' 1.041967 3]
['2022-07-04 17:34:00+00:00' 1.041967 4]
['2022-07-04 17:34:15+00:00' 1.041966 4]
['2022-07-04 17:34:30+00:00' 1.041967 4]
['2022-07-04 17:34:45+00:00' 1.041968 4]
['2022-07-04 17:35:00+00:00' 1.041969 4]
['2022-07-04 17:35:15+00:00' 1.041971 4]
['2022-07-04 17:35:30+00:00' 1.041973 4]
['2022-07-04 17:35:45+00:00' 1.041975 4]
['2022-07-04 17:36:00+00:00' 1.041977 5]
['2022-07-04 17:36:15+00:00' 1.041979 5]
['2022-07-04 17:36:30+00:00' 1.041981 5]
['2022-07-04 17:36:45+00:00' 1.041983 5]
['2022-07-04 17:37:00+00:00' 1.041985 5]
['2022-07-04 17:37:15+00:00' 1.041986 6]
['2022-07-04 17:37:30+00:00' 1.041987 6]
['2022-07-04 17:37:45+00:00' 1.041988 6]
['2022-07-04 17:38:00+00:00' 1.041989 6]
['2022-07-04 17:38:15+00:00' 1.04199 4]
['2022-07-04 17:38:30+00:00' 1.041995 4]]
numpy:
[['2022-07-04 17:33:45+00:00' 1.041967 3 1.041967 1.041967]
['2022-07-04 17:34:00+00:00' 1.041967 4 1.041966 1.041975]
['2022-07-04 17:34:15+00:00' 1.041966 4 1.041966 1.041975]
['2022-07-04 17:34:30+00:00' 1.041967 4 1.041966 1.041975]
['2022-07-04 17:34:45+00:00' 1.041968 4 1.041966 1.041975]
['2022-07-04 17:35:00+00:00' 1.041969 4 1.041966 1.041975]
['2022-07-04 17:35:15+00:00' 1.041971 4 1.041966 1.041975]
['2022-07-04 17:35:30+00:00' 1.041973 4 1.041966 1.041975]
['2022-07-04 17:35:45+00:00' 1.041975 4 1.041966 1.041975]
['2022-07-04 17:36:00+00:00' 1.041977 5 1.041977 1.041985]
['2022-07-04 17:36:15+00:00' 1.041979 5 1.041977 1.041985]
['2022-07-04 17:36:30+00:00' 1.041981 5 1.041977 1.041985]
['2022-07-04 17:36:45+00:00' 1.041983 5 1.041977 1.041985]
['2022-07-04 17:37:00+00:00' 1.041985 5 1.041977 1.041985]
['2022-07-04 17:37:15+00:00' 1.041986 6 1.041986 1.041989]
['2022-07-04 17:37:30+00:00' 1.041987 6 1.041986 1.041989]
['2022-07-04 17:37:45+00:00' 1.041988 6 1.041986 1.041989]
['2022-07-04 17:38:00+00:00' 1.041989 6 1.041986 1.041989]
['2022-07-04 17:38:15+00:00' 1.04199 4 1.04199 1.041995]
['2022-07-04 17:38:30+00:00' 1.041995 4 1.04199 1.041995]]
Pandas 说明:
- 创建
bin_index
列,该列检测bin
中的更改,并 for each 这样的行增加一个id值
- 使用
DataFrame.groupby()
基于bin_index
执行聚合(min
max
)
- 使用
DataFrame.join()
(对聚合数据帧aggs
进行预处理以删除其名为butter
的多索引的级别)向原始数据帧添加min
和max
列.
Numpy解释:
- 创建
bin_index
个数组,该数组检测bin
中的变化,并 for each 这样的行增加一个id值
- 准备
aggs
作为形状为a.shape[0], 2
的数组,用于接收输入数组a
中对应bin
值的min
和max
列
- 对
bin_index
中的每个唯一bin
值使用布尔掩码,在a
的butter
列的相应行上执行聚合,并将这两个值放置在aggs
列中,用于这些相同的行
- 使用
numpy.concatenate()
将a
和aggs
水平粘合在一起.