Python Pandas 数据帧到滑动窗口

发布于01月14日

我有一个Pandas 数据框:

timestamp	A	B	C	D
2023-09-27 14:05:50	1	2	3	4
2023-09-27 14:05:51	5	6	7	8
2023-09-27 14:05:52	9	10	11	12
2023-09-27 14:05:53	13	14	15	16
2023-09-27 14:05:54	17	18	19	20
2023-09-27 14:05:55	21	22	23	24

为了将其提供给KERAS自动编码器，我需要数据的窗口版本(例如，窗口=3):

timestamp	0	1	2	3	4	5	6	7	8	9	10	11
2023-09-27 14:05:50	1	2	3	4	5	6	7	8	9	10	11	12
2023-09-27 14:05:51	5	6	7	8	9	10	11	12	13	14	15	16
2023-09-27 14:05:52	9	10	11	12	13	14	15	16	17	18	19	20
2023-09-27 14:05:53	13	14	15	16	17	18	19	20	21	22	23	24

我写了一个函数，但我想我可能没有抓住要点.我在后面的过程中遇到了问题，处理我的数据花费了难以置信的时间(&gt；10个小时，在一台拥有128个内核、大量RAM和32个GPU显卡的机器上).

def makeWindowDataFrame(df, windowSize):
    table = []
    for window in df.rolling(window=windowSize):
        if len(window) >= windowSize:
            arr = []
            for el in window.iloc:
                arr.extend(el.to_numpy().reshape(-1))
            table.append(arr)
    longest = len(max(table, key=len))
    return pd.DataFrame(table, columns=[a for a in range(longest)])

有没有更简单的方法来创建这个数据集？这个操作是我的设置中运行时间最长的.

EDIT 1:个

def win(df, N):
    return pd.DataFrame(sliding_window_view(df, N, axis=0).swapaxes(1, 2).reshape(len(df)-N+1, -1), index=df.index[:len(df)-N+1])

df = pd.DataFrame( {'timestamp': {28384: pd.Timestamp('2023-09-27 14:05:50'), 28385: pd.Timestamp('2023-09-27 14:05:52'), 28386: pd.Timestamp('2023-09-27 14:05:54'), 28387: pd.Timestamp('2023-09-27 14:05:56'), 28388: pd.Timestamp('2023-09-27 14:05:58')}, 'p1l4e0': {28384: 0.8869906663894653, 28385: 0.9212895035743713, 28386: 0.9084778428077698, 28387: 0.8959079384803772, 28388: 0.9066142439842224}, 'p1l4e1': {28384: 0.3119787573814392, 28385: 0.31039634346961975, 28386: 0.3139703571796417, 28387: 0.3119153082370758, 28388: 0.30586937069892883}, 'p1l4e2': {28384: 0.9320452809333801, 28385: 0.9452565312385559, 28386: 0.9435424208641052, 28387: 0.9356696605682373, 28388: 0.9325512647628784}, 'p1l4e3': {28384: 0.10841193050146103, 28385: 0.1134769469499588, 28386: 0.11245745420455933, 28387: 0.109752357006073, 28388: 0.10924666374921799}} )

win(df, 3)

0	1	2	3	4	5	6	7	8	9	10	11	12	13	14
28384	2023-09-27 14:05:50	0.886991	0.311979	0.932045	0.108412	2023-09-27 14:05:52	0.92129	0.310396	0.945257	0.113477	2023-09-27 14:05:54	0.908478	0.31397	0.943542
28385	2023-09-27 14:05:52	0.92129	0.310396	0.945257	0.113477	2023-09-27 14:05:54	0.908478	0.31397	0.943542	0.112457	2023-09-27 14:05:56	0.895908	0.311915	0.93567
28386	2023-09-27 14:05:54	0.908478	0.31397	0.943542	0.112457	2023-09-27 14:05:56	0.895908	0.311915	0.93567	0.109752	2023-09-27 14:05:58	0.906614	0.305869	0.932551

EDIT 2:个

好像没有设置索引.这可以解释为什么它不起作用.

df = df.set_index('timestamp')
df.head().to_dict('tight')

{
'index': [28384, 28385, 28386, 28387, 28388],
'columns': ['timestamp', 'p1l4e0', 'p1l4e1', 'p1l4e2', 'p1l4e3'],
'data': ...,
'index_names': [None],
'column_names': [None]
}

EDIT 3:个

重新启动内核后，它可以正常工作.测试表明，在较大的数据集上，该解决方案的加速比最低为1000倍，速度为1k-10k倍.谢谢.

0 1 2 3 4 5 6 7 8 9 10 11 timestamp 2023-09-27 14:05:50 1 2 3 4 5 6 7 8 9 10 11 12 2023-09-27 14:05:51 5 6 7 8 9 10 11 12 13 14 15 16 2023-09-27 14:05:52 9 10 11 12 13 14 15 16 17 18 19 20 2023-09-27 14:05:53 13 14 15 16 17 18 19 20 21 22 23 24

0 1 2 3 4 5 6 7 timestamp 2023-09-27 14:05:50 1 2 3 4 5 6 7 8 2023-09-27 14:05:51 5 6 7 8 9 10 11 12 2023-09-27 14:05:52 9 10 11 12 13 14 15 16 2023-09-27 14:05:53 13 14 15 16 17 18 19 20 2023-09-27 14:05:54 17 18 19 20 21 22 23 24

N = 3 cols = ['A', 'B', 'C', 'D'] out = (df[df.columns.difference(cols)].head(-N+1) .join(pd.DataFrame(svw(df[cols], N, axis=0) .swapaxes(1, 2) .reshape(len(df)-N+1, -1), index=df.index[:len(df)-N+1]) ) )

Python Pandas 数据帧到滑动窗口

推荐答案

timings

on 6 rows, N=3

on 60k rows, N=3

on 1.2M rows, N=3

Python相关问答推荐

Inquirer库不适用于Pyterfly

如何最好地处理严重级联的json

为什么我的代码会进入无限循环？

脚注在Python中使用regex导致错误匹配

是否有方法将现有的X-Y图转换为X-Y-Y1图(以重新填充)？

GEKKO：已知延迟的延迟系统的参数估计

使用argsorted索引子集索引数组

如何使用Python中的clinicalTrials.gov API获取完整结果？

如何使用SubProcess/Shell从Python脚本中调用具有几个带有html标签的参数的Perl脚本？

使用LineConnection动画1D数据

Python中的嵌套Ruby哈希

未删除映射表的行

如何使用Python以编程方式判断和检索Angular网站的动态内容？

海上重叠直方图

如何在UserSerializer中添加显式字段？

当点击tkinter菜单而不是菜单选项时，如何执行命令？

在单个对象中解析多个Python数据帧

计算天数

如何在Python中使用Pandas将R s Tukey s HSD表转换为相关矩阵''

以逻辑方式获取自己的pyproject.toml依赖项