Python3.x 有没有办法使用重采样矢量化添加缺失的月份

发布于07月08日

我试图为每ID个月添加缺失的月份.添加的月份应该有关于ID和year_month的信息，以及产品的NaN.我的代码使用apply()实现了这一点，但速度很慢——我正在寻找一个矢量化版本，它可以运行得更快.

具体来说，在我的系统上，有60000行，df.set_index(df.index).groupby('ID').apply(add_missing_months)行大约需要20秒.我计划处理数百万行的数据，因此我认为需要对操作进行矢量化.非常感谢您的帮助！

import pandas as pd
df = pd.DataFrame({'ID': [1, 1, 1, 2, 2, 3], 'year_month': ['2020-01-01','2020-08-01','2020-10-01','2020-01-01','2020-07-01','2021-05-01'], 'product':['A','B','C','A','D','C']})

# Enlarge dataset to 60 000 rows
for i in range(9999):
    df2 = df.iloc[-6:].copy()
    df2['ID'] = df2['ID'] + 3
    df = pd.concat([df,df2], axis=0, ignore_index=True)

df['year_month'] = pd.to_datetime(df['year_month'])
df.index = pd.to_datetime(df['year_month'], format = '%Y%m%d')
df = df.drop('year_month', axis = 1)

# The slow function
def add_missing_months(s):
    min_d = s.index.min()
    max_d = s.index.max()
    s = s.reindex(pd.date_range(min_d, max_d, freq='MS'))
    return(s)

df = df.set_index(df.index).groupby('ID').apply(add_missing_months)
df = df.drop('ID', axis = 1)
df = df.reset_index()

df1 = (df.assign(year_month = df['year_month'].dt.to_period('m')) .groupby(['ID'])['year_month'] .agg(['min', 'max'])) diff = df1['max'].astype('int').sub(df1['min'].astype('int')) + 1 df1 = df1.loc[df1.index.repeat(diff)] df1 = (df1['min'].add(df1.groupby(level=0).cumcount())) .dt.to_timestamp() .reset_index(name='year_month')) df = df1.merge(df.rename_axis(None), how='left')

df = pd.DataFrame({'ID': [1, 1, 1, 2, 2, 3], 'year_month': ['2020-01-01','2020-08-01','2020-10-01','2020-01-01','2020-07-01','2021-05-01'], 'product':['A','B','C','A','D','C']}) # Enlarge dataset to 60 000 rows for i in range(9999): df2 = df.iloc[-6:].copy() df2['ID'] = df2['ID'] + 3 df = pd.concat([df,df2], axis=0, ignore_index=True) df['year_month'] = pd.to_datetime(df['year_month']) df.index = pd.to_datetime(df['year_month'], format = '%Y%m%d') def jez(df): df1 = df.assign(year_month = df['year_month'].dt.to_period('m')).groupby(['ID'])['year_month'].agg(['min', 'max']) df1 = df1.loc[df1.index.repeat( df1['max'].astype('int').sub(df1['min'].astype('int')) + 1)] df1 = (df1['min'] + df1.groupby(level=0).cumcount()).dt.to_timestamp().reset_index(name='year_month') return df1.merge(df.rename_axis(None), how='left')

def vogel(df): min_d = df['year_month'].min() max_d = df['year_month'].max() # generate all possible combinations of date and ID df_agg = df.groupby(['ID'])['year_month'].agg(['min', 'max']) df = pd.DataFrame( index=pd.MultiIndex.from_product( [pd.date_range(min_d, max_d, freq='MS'), df_agg.index] ) ) # reduce to only relevant dates df = df.merge(df_agg, left_on='ID', right_index=True) df = df.reset_index().rename(columns={'level_0': 'year_month'}) df = df[df['year_month'].between(df['min'], df['max'])] df = df.drop(columns=['min', 'max']) # add product information df = df.merge(df, how='left') return df

Python3.x 有没有办法使用重采样矢量化添加缺失的月份

推荐答案

Python-3.x相关问答推荐

使用Polars阅读按日期键分区的最新S3镶木地板文件

Django将任何查询显示为html表格

如何在 python 中将带有时区信息的时间戳转换为 utc 时间

ValueError at /register/ 视图authenticate.views.register_user 未返回HttpResponse 对象.它返回 None 相反

使用Python按照其组/ID的紧密值的递增顺序映射数据框的两列

拆分列表的元素并将拆分后的元素包含到列表中

将数据框中的值与另一个数据框中的多列进行比较，以获取条目以有效方式匹配的列表列表

无法使用 curve_fit() 在 python 中复制高斯函数的曲线拟合

有没有办法使用 python opencv 计算与图像的白色距离

过滤查询集和Q运算符的不同值

使用 pandas 进行多类分类的总体准确度

两个字符串之间的正则表达式匹配？

将字符串表示与使用整数值的枚举相关联？

UnicodeDecodeError：utf-8编解码器无法解码位置 1 的字节 0x8b：无效的起始字节，同时读取Pandas中的 csv 文件

如何在元素列表中找到最大的数字，可能是非唯一的？

在 ubuntu 20.04 中安装 libpq-dev 时出现问题

如何使用请求发送带有标头的 PATCH 请求

警告：请使用 tensorflow/models 中的官方/mnist/dataset.py 等替代方案

Python 3.4 多处理队列比 Pipe 快，出乎意料

如何更改 tkinter 文本小部件中某些单词的 colored颜色？