我想计算一个金融数据时间序列的滚动"1m"统计数据.假设每个月的计算行数不会总是相等,除非您有足够的数据来划分月份,这是不常见的.

我try 分配一个列window_index来跟踪计算中包含的行,因为我将使用.rolling().over()表达式来计算每个窗口上的统计数据.我希望从最近的日期开始添加window_index整数,并以其方式返回.

Here is an image to help explain: Stock Data

目前,我正在try 做的是将红笔和蓝笔所表示的window_index按组相加.虽然我希望数据按黄色钢笔标记进行分组.数据集在2023-02-07结束,一个月前将是2023-01-07,或最接近的值2023-01-06.

这是我用来实现这一点的代码,但我不确定如何获得我想要的分组窗口.

df_window_index = (
    data.group_by_dynamic(
        index_column="date", every="1m", by="symbol"
    )
    .agg()
    .with_columns(
        pl.int_range(0, pl.len()).over("symbol").alias("window_index")
    )
)
data = data.join_asof(df_window_index, on="date", by="symbol").sort(
    "symbol"
)

使用timedelta而不是字符串似乎并不能解决问题.

左边是上面的代码,右边的图像是当我赚every=timedelta(days=31)的时候.我还是不明白为什么北极星要把那些日期改成31天的时间差.

Trying TimeDelta

任何帮助都将不胜感激,或者任何能给我指明正确方向的小贴士!谢谢!

生效日期:

df = pl.read_csv(b"""
date,open,high,low,close,volume,dividends,stock_splits,symbol,window_index
2021-01-04T00:00:00.000000000,133.52,133.61,126.76,129.41,143301900,0.0,0.0,AAPL,0
2021-01-05T00:00:00.000000000,128.89,131.74,128.43,131.01,97664900,0.0,0.0,AAPL,0
2021-01-06T00:00:00.000000000,127.72,131.05,126.38,126.6,155088000,0.0,0.0,AAPL,0
2021-01-07T00:00:00.000000000,128.36,131.63,127.86,130.92,109578200,0.0,0.0,AAPL,1
2021-01-08T00:00:00.000000000,132.43,132.63,130.23,132.05,105158200,0.0,0.0,AAPL,1
2021-01-11T00:00:00.000000000,129.19,130.17,128.5,128.98,100384500,0.0,0.0,AAPL,1
2021-01-12T00:00:00.000000000,128.5,129.69,126.86,128.8,91951100,0.0,0.0,AAPL,1
2021-01-13T00:00:00.000000000,128.76,131.45,128.49,130.89,88636800,0.0,0.0,AAPL,1
2021-01-14T00:00:00.000000000,130.8,131.0,128.76,128.91,90221800,0.0,0.0,AAPL,1
2021-01-15T00:00:00.000000000,128.78,130.22,127.0,127.14,111598500,0.0,0.0,AAPL,1
2021-01-19T00:00:00.000000000,127.78,128.71,126.94,127.83,90757300,0.0,0.0,AAPL,1
2021-01-20T00:00:00.000000000,128.66,132.49,128.55,132.03,104319500,0.0,0.0,AAPL,1
2021-01-21T00:00:00.000000000,133.8,139.67,133.59,136.87,120150900,0.0,0.0,AAPL,1
2021-01-22T00:00:00.000000000,136.28,139.85,135.02,139.07,114459400,0.0,0.0,AAPL,1
2021-01-25T00:00:00.000000000,143.07,145.09,136.54,142.92,157611700,0.0,0.0,AAPL,1
2021-01-26T00:00:00.000000000,143.6,144.3,141.37,143.16,98390600,0.0,0.0,AAPL,1
2021-01-27T00:00:00.000000000,143.43,144.3,140.41,142.06,140843800,0.0,0.0,AAPL,1
2021-01-28T00:00:00.000000000,139.52,141.99,136.7,137.09,142621100,0.0,0.0,AAPL,1
2021-01-29T00:00:00.000000000,135.83,136.74,130.21,131.96,177523800,0.0,0.0,AAPL,1
2021-02-01T00:00:00.000000000,133.75,135.38,130.93,134.14,106239800,0.0,0.0,AAPL,1
2021-02-02T00:00:00.000000000,135.73,136.31,134.61,134.99,83305400,0.0,0.0,AAPL,1
2021-02-03T00:00:00.000000000,135.76,135.77,133.61,133.94,89880900,0.0,0.0,AAPL,1
2021-02-04T00:00:00.000000000,136.3,137.4,134.59,137.39,84183100,0.0,0.0,AAPL,1
2021-02-05T00:00:00.000000000,137.35,137.42,135.86,136.76,75693800,0.2,0.0,AAPL,1
""".strip(), try_parse_dates=True)

推荐答案

要标记非重叠的k个月时间窗口,从每个组中的最新日期开始并向后返回,您可以使用以下助手函数.

def create_monthly_window_index(col: str, k: int = 1):
    year_diff = pl.col(col).last().dt.year() - pl.col(col).dt.year()
    month_diff = pl.col(col).last().dt.month() - pl.col(col).dt.month()
    day_indicator = pl.col(col).dt.day() > pl.col(col).last().dt.day()
    return (12 * year_diff + month_diff - day_indicator) // k

它可以按如下方式使用.

df.with_columns(create_monthly_window_index("date", k=1).alias("window_index").over("group"))

请注意,任意时间窗口的一般情况更加复杂(如我的 comments 中所概述的).

Python相关问答推荐

pandas DataFrame GroupBy.diff函数的意外输出

Python json.转储包含一些UTF-8字符的二元组,要么失败,要么转换它们.我希望编码字符按原样保留

基于字符串匹配条件合并两个帧

无法使用DBFS File API路径附加到CSV In Datricks(OSError Errno 95操作不支持)

实现自定义QWidgets作为QTimeEdit的弹出窗口

有没有一种ONE—LINER的方法给一个框架的每一行一个由整数和字符串组成的唯一id?

删除marplotlib条形图上的底边

名为__main__. py的Python模块在导入时不运行'

使用BeautifulSoup抓取所有链接

使用Python从rotowire中抓取MLB每日阵容

Matplotlib中的字体权重

跳过嵌套JSON中的级别并转换为Pandas Rame

如何使用使用来自其他列的值的公式更新一个rabrame列?

Odoo16:模板中使用的docs变量在哪里定义?

使用类型提示进行类型转换

为用户输入的整数查找根/幂整数对的Python练习

python的文件. truncate()意外地没有截断'

ModuleNotFoundError:Python中没有名为google的模块''

设置索引值每隔17行左右更改的索引

从列表中分离数据的最佳方式