介绍
在Polar中,我想要做相当复杂的查询,我想通过将操作划分为方法来简化过程.在此之前,我需要了解如何为这些函数提供多个列和变量.
示例数据
# Libraries
import polars as pl
from datetime import datetime
# Data
test_data = pl.DataFrame({
"class": ['A', 'A', 'A', 'B', 'B', 'C'],
"date": [datetime(2020, 1, 31), datetime(2020, 2, 28), datetime(2021, 1, 31),
datetime(2022, 1, 31), datetime(2023, 2, 28),
datetime(2020, 1, 31)],
"status": [1,0,1,0,1,0]
})
问题所在
对于每一组,我想知道参考日期是否与数据框日期栏中的年-月重叠.
我想做这样的事情.
# Some date
reference_date = datetime(2020, 1, 2)
# What I would expect the query to look like
(test_data
.groupby("class")
.agg([
pl.col("status").count().alias("row_count"), #just to show code that works
pl.lit(reference_date).alias("reference_date"),
pl.col(["date", "status"])
.apply(lambda group: myfunc(group, reference_date))
.alias("point_in_time_status")
])
)
# The desired output
pl.DataFrame({
"class": ['A', 'B', 'C'],
"reference_date": [datetime(2020, 1, 2), datetime(2020, 1, 2), datetime(2020, 1, 2)],
"point_in_time_status": [1,0,0]
})
但我就是找不到对群体进行操作的任何解决方案.有些人建议使用pl.struct,但这只会输出一些奇怪的对象,没有列或要处理的任何内容.
同一运算的R中的示例
# Loading library
library(tidyverse)
# Creating dataframe
df <- data.frame(
date = c(as.Date("2020-01-31"),
as.Date("2020-02-28"), as.Date("2021-01-31"),
as.Date("2022-01-31"), as.Date("2023-02-28"),
as.Date("2020-01-31")),
status = c(1,0,1,0,1,0),
class = c("A","A","A","B","B","C"))
# Finding status in overlapping months
ref_date = as.Date("2020-01-02")
df %>%
group_by("class") %>%
filter(format(date, "%Y-%m") == format(ref_date, "%Y-%m")) %>%
filter(status == 1)