我有一个脚本,使用np.digitize个这样的值生成bins:

| time | butter | bin |
| :---- | :------ | :--- |
| 2022-07-04 17:33:45+00:00 | 1.041967 | 3 |
| 2022-07-04 17:34:00+00:00 | 1.041967 | 4 |
| 2022-07-04 17:34:15+00:00 | 1.041966 | 4 |
| 2022-07-04 17:34:30+00:00 | 1.041967 | 4 |
| 2022-07-04 17:34:45+00:00 | 1.041968 | 4 |
| 2022-07-04 17:35:00+00:00 | 1.041969 | 4 |
| 2022-07-04 17:35:15+00:00 | 1.041971 | 4 |
| 2022-07-04 17:35:30+00:00 | 1.041973 | 4 |
| 2022-07-04 17:35:45+00:00 | 1.041975 | 4 |
| 2022-07-04 17:36:00+00:00 | 1.041977 | 5 |
| 2022-07-04 17:36:15+00:00 | 1.041979 | 5 |
| 2022-07-04 17:36:30+00:00 | 1.041981 | 5 |
| 2022-07-04 17:36:45+00:00 | 1.041983 | 5 |
| 2022-07-04 17:37:00+00:00 | 1.041985 | 5 |
| 2022-07-04 17:37:15+00:00 | 1.041986 | 6 |
| 2022-07-04 17:37:30+00:00 | 1.041987 | 6 |
| 2022-07-04 17:37:45+00:00 | 1.041988 | 6 |
| 2022-07-04 17:38:00+00:00 | 1.041989 | 6 |

bins可以是increase/decrease,他们可以跳过垃圾箱.

如何计算每个箱子的局部最小值/最大值,使结果如下所示:

| time | butter | bin | min | max |
| :---- | :------ | :--- | :--- | :--- |
| 2022-07-04 17:33:45+00:00 | 1.041967 | 3 | 1.041967 | 1.041967 |
| 2022-07-04 17:34:00+00:00 | 1.041968 | 4 | 1.041966 | 1.041975 |
| 2022-07-04 17:34:15+00:00 | 1.041966 | 4 | 1.041966 | 1.041975 |
| 2022-07-04 17:34:30+00:00 | 1.041967 | 4 | 1.041966 | 1.041975 |
| 2022-07-04 17:34:45+00:00 | 1.041968 | 4 | 1.041966 | 1.041975 |
| 2022-07-04 17:35:00+00:00 | 1.041969 | 4 | 1.041966 | 1.041975 |
| 2022-07-04 17:35:15+00:00 | 1.041971 | 4 | 1.041966 | 1.041975 |
| 2022-07-04 17:35:30+00:00 | 1.041973 | 4 | 1.041966 | 1.041975 |
| 2022-07-04 17:35:45+00:00 | 1.041975 | 4 | 1.041966 | 1.041975 |
| 2022-07-04 17:36:00+00:00 | 1.041977 | 5 | 1.041977 | 1.041985 |
| 2022-07-04 17:36:15+00:00 | 1.041979 | 5 | 1.041977 | 1.041985 |
| 2022-07-04 17:36:30+00:00 | 1.041981 | 5 | 1.041977 | 1.041985 |
| 2022-07-04 17:36:45+00:00 | 1.041983 | 5 | 1.041977 | 1.041985 |
| 2022-07-04 17:37:00+00:00 | 1.041985 | 5 | 1.041977 | 1.041985 |
| 2022-07-04 17:37:15+00:00 | 1.041986 | 6 | 1.041986 | 1.041989 |
| 2022-07-04 17:37:30+00:00 | 1.041987 | 6 | 1.041986 | 1.041989 |
| 2022-07-04 17:37:45+00:00 | 1.041988 | 6 | 1.041986 | 1.041989 |
| 2022-07-04 17:38:00+00:00 | 1.041989 | 6 | 1.041986 | 1.041989 |

我试着用np.wherenp.diff来回答这个问题,但我还没有找到答案,我也不觉得这是解决这个问题的最好方法.

提前谢谢你.

推荐答案

下面有两种方法可以解决您的问题,一种是使用pandas,另一种是使用numpy(UPDATED反映了OP在关于基于连续分组的bin个值进行装箱的 comments 中的澄清):

res = df.assign(bin_index = (df.bin != df.bin.shift()).cumsum())

dfAggs = res[['butter', 'bin', 'bin_index']].groupby(['bin', 'bin_index']).agg([min, max])
dfAggs.columns = dfAggs.columns.droplevel()
res = res.join(dfAggs, on=['bin', 'bin_index']).drop(columns='bin_index')
print("", "pandas:", res, sep="\n")

a = df.copy().to_numpy()
print("", "input as numpy 2d array", a, sep="\n")
bin_index = a[:,2:3] != np.concatenate((np.full((1, 1), np.NaN), a[:-1,2:3]), axis = 0)
bin_index = np.cumsum(bin_index)
bins = np.unique(bin_index)
aggs = np.empty((a.shape[0], 2))
for b in bins:
    mask = bin_index==b
    aggs[mask, :] = (a[mask, 1].min(), a[mask, 1].max())
res = np.concatenate((a, aggs), axis=1)
print("", "numpy:", res, sep="\n")

输出:

input as pandas dataframe
                         time    butter  bin
0   2022-07-04 17:33:45+00:00  1.041967    3
1   2022-07-04 17:34:00+00:00  1.041967    4
2   2022-07-04 17:34:15+00:00  1.041966    4
3   2022-07-04 17:34:30+00:00  1.041967    4
4   2022-07-04 17:34:45+00:00  1.041968    4
5   2022-07-04 17:35:00+00:00  1.041969    4
6   2022-07-04 17:35:15+00:00  1.041971    4
7   2022-07-04 17:35:30+00:00  1.041973    4
8   2022-07-04 17:35:45+00:00  1.041975    4
9   2022-07-04 17:36:00+00:00  1.041977    5
10  2022-07-04 17:36:15+00:00  1.041979    5
11  2022-07-04 17:36:30+00:00  1.041981    5
12  2022-07-04 17:36:45+00:00  1.041983    5
13  2022-07-04 17:37:00+00:00  1.041985    5
14  2022-07-04 17:37:15+00:00  1.041986    6
15  2022-07-04 17:37:30+00:00  1.041987    6
16  2022-07-04 17:37:45+00:00  1.041988    6
17  2022-07-04 17:38:00+00:00  1.041989    6
18  2022-07-04 17:38:15+00:00  1.041990    4
19  2022-07-04 17:38:30+00:00  1.041995    4

pandas:
                         time    butter  bin       min       max
0   2022-07-04 17:33:45+00:00  1.041967    3  1.041967  1.041967
1   2022-07-04 17:34:00+00:00  1.041967    4  1.041966  1.041975
2   2022-07-04 17:34:15+00:00  1.041966    4  1.041966  1.041975
3   2022-07-04 17:34:30+00:00  1.041967    4  1.041966  1.041975
4   2022-07-04 17:34:45+00:00  1.041968    4  1.041966  1.041975
5   2022-07-04 17:35:00+00:00  1.041969    4  1.041966  1.041975
6   2022-07-04 17:35:15+00:00  1.041971    4  1.041966  1.041975
7   2022-07-04 17:35:30+00:00  1.041973    4  1.041966  1.041975
8   2022-07-04 17:35:45+00:00  1.041975    4  1.041966  1.041975
9   2022-07-04 17:36:00+00:00  1.041977    5  1.041977  1.041985
10  2022-07-04 17:36:15+00:00  1.041979    5  1.041977  1.041985
11  2022-07-04 17:36:30+00:00  1.041981    5  1.041977  1.041985
12  2022-07-04 17:36:45+00:00  1.041983    5  1.041977  1.041985
13  2022-07-04 17:37:00+00:00  1.041985    5  1.041977  1.041985
14  2022-07-04 17:37:15+00:00  1.041986    6  1.041986  1.041989
15  2022-07-04 17:37:30+00:00  1.041987    6  1.041986  1.041989
16  2022-07-04 17:37:45+00:00  1.041988    6  1.041986  1.041989
17  2022-07-04 17:38:00+00:00  1.041989    6  1.041986  1.041989
18  2022-07-04 17:38:15+00:00  1.041990    4  1.041990  1.041995
19  2022-07-04 17:38:30+00:00  1.041995    4  1.041990  1.041995

input as numpy 2d array
[['2022-07-04 17:33:45+00:00' 1.041967 3]
 ['2022-07-04 17:34:00+00:00' 1.041967 4]
 ['2022-07-04 17:34:15+00:00' 1.041966 4]
 ['2022-07-04 17:34:30+00:00' 1.041967 4]
 ['2022-07-04 17:34:45+00:00' 1.041968 4]
 ['2022-07-04 17:35:00+00:00' 1.041969 4]
 ['2022-07-04 17:35:15+00:00' 1.041971 4]
 ['2022-07-04 17:35:30+00:00' 1.041973 4]
 ['2022-07-04 17:35:45+00:00' 1.041975 4]
 ['2022-07-04 17:36:00+00:00' 1.041977 5]
 ['2022-07-04 17:36:15+00:00' 1.041979 5]
 ['2022-07-04 17:36:30+00:00' 1.041981 5]
 ['2022-07-04 17:36:45+00:00' 1.041983 5]
 ['2022-07-04 17:37:00+00:00' 1.041985 5]
 ['2022-07-04 17:37:15+00:00' 1.041986 6]
 ['2022-07-04 17:37:30+00:00' 1.041987 6]
 ['2022-07-04 17:37:45+00:00' 1.041988 6]
 ['2022-07-04 17:38:00+00:00' 1.041989 6]
 ['2022-07-04 17:38:15+00:00' 1.04199 4]
 ['2022-07-04 17:38:30+00:00' 1.041995 4]]

numpy:
[['2022-07-04 17:33:45+00:00' 1.041967 3 1.041967 1.041967]
 ['2022-07-04 17:34:00+00:00' 1.041967 4 1.041966 1.041975]
 ['2022-07-04 17:34:15+00:00' 1.041966 4 1.041966 1.041975]
 ['2022-07-04 17:34:30+00:00' 1.041967 4 1.041966 1.041975]
 ['2022-07-04 17:34:45+00:00' 1.041968 4 1.041966 1.041975]
 ['2022-07-04 17:35:00+00:00' 1.041969 4 1.041966 1.041975]
 ['2022-07-04 17:35:15+00:00' 1.041971 4 1.041966 1.041975]
 ['2022-07-04 17:35:30+00:00' 1.041973 4 1.041966 1.041975]
 ['2022-07-04 17:35:45+00:00' 1.041975 4 1.041966 1.041975]
 ['2022-07-04 17:36:00+00:00' 1.041977 5 1.041977 1.041985]
 ['2022-07-04 17:36:15+00:00' 1.041979 5 1.041977 1.041985]
 ['2022-07-04 17:36:30+00:00' 1.041981 5 1.041977 1.041985]
 ['2022-07-04 17:36:45+00:00' 1.041983 5 1.041977 1.041985]
 ['2022-07-04 17:37:00+00:00' 1.041985 5 1.041977 1.041985]
 ['2022-07-04 17:37:15+00:00' 1.041986 6 1.041986 1.041989]
 ['2022-07-04 17:37:30+00:00' 1.041987 6 1.041986 1.041989]
 ['2022-07-04 17:37:45+00:00' 1.041988 6 1.041986 1.041989]
 ['2022-07-04 17:38:00+00:00' 1.041989 6 1.041986 1.041989]
 ['2022-07-04 17:38:15+00:00' 1.04199 4 1.04199 1.041995]
 ['2022-07-04 17:38:30+00:00' 1.041995 4 1.04199 1.041995]]

Pandas 说明:

  • 创建bin_index列,该列检测bin中的更改,并 for each 这样的行增加一个id值
  • 使用DataFrame.groupby()基于bin_index执行聚合(minmax)
  • 使用DataFrame.join()(对聚合数据帧aggs进行预处理以删除其名为butter的多索引的级别)向原始数据帧添加minmax列.

Numpy解释:

  • 创建bin_index个数组,该数组检测bin中的变化,并 for each 这样的行增加一个id值
  • 准备aggs作为形状为a.shape[0], 2的数组,用于接收输入数组a中对应bin值的minmax
  • bin_index中的每个唯一bin值使用布尔掩码,在abutter列的相应行上执行聚合,并将这两个值放置在aggs列中,用于这些相同的行
  • 使用numpy.concatenate()aaggs水平粘合在一起.

Python相关问答推荐

Pandas 有条件轮班操作

如何在Python中并行化以下搜索?

Julia CSV for Python中的等效性Pandas index_col参数

部分视图的DataFrame

使用Python更新字典中的值

迭代嵌套字典的值

连接一个rabrame和另一个1d rabrame不是问题,但当使用[...]'运算符会产生不同的结果

我的字符串搜索算法的平均时间复杂度和最坏时间复杂度是多少?

如何使regex代码只适用于空的目标单元格

如何使用两个关键函数来排序一个多索引框架?

在不同的帧B中判断帧A中的子字符串,每个帧的大小不同

找到相对于列表索引的当前最大值列表""

在Docker容器(Alpine)上运行的Python应用程序中读取. accdb数据库

如何过滤组s最大和最小行使用`transform`'

一个telegram 机器人应该发送一个测验如何做?""

Pandas在rame中在组内洗牌行,保持相对组的顺序不变,

在不降低分辨率的情况下绘制一组数据点的最外轮廓

从`end_date`回溯,如何计算以极为单位的滚动统计量?

更改Python中的数据格式

为什么Python多处理.Process()传递队列参数并且读取比函数传递队列参数和读取更快?