请判断我的代码.它不是完美的(对不起,我没时间了:d)但它可以是你一个很好的起点
我能够用几个函数生成与您的输出类似的结果:
SEQUENCE(START_DATE,END_DATE,间隔1天)-我使用它来生成START_DATE和END_DATE之间的行
F.date_trunc(timestamp="date", format="week")
-我在这里找到一周的第一天,以获得我使用类似东西的最后一天:
F.date_sub(
F.date_trunc(timestamp=F.date_add(F.col("date"), 7), format="week"), 1
)
然后使用Case/When to Cut Not Full Week将一周的开始/结束日期与开始/结束日期对齐
当一切准备就绪后,我只是将其分组以获得所需的输出
示例代码:
import datetime
import pyspark.sql.functions as F
data = [
("0001", "1111", datetime.date(2019, 10, 10), datetime.date(2019, 11, 6)),
("0002", "1112", datetime.date(2020, 11, 26), datetime.date(2021, 1, 6)),
("0003", "1113", datetime.date(2020, 9, 24), datetime.date(2020, 10, 21)),
]
df_1 = spark.createDataFrame(
data, schema=["client", "family_code", "start_date", "end_date"]
)
windowSpec = Window.partitionBy("client", "family_code").orderBy("date")
dfExplodedAndAlligned = (
df_1.withColumn(
"date", F.explode(F.expr("sequence(start_date, end_date, interval 1 day)"))
)
.withColumn("start_of_week", F.date_trunc(timestamp="date", format="week"))
.withColumn(
"end_of_week",
F.date_sub(
F.date_trunc(timestamp=F.date_add(F.col("date"), 7), format="week"), 1
),
)
.withColumn(
"start_of_week",
F.when(
F.col("start_date") > F.col("start_of_week"), F.col("start_date")
).otherwise(F.col("start_of_week")),
)
.withColumn(
"end_of_week",
F.when(F.col("end_date") < F.col("end_of_week"), F.col("end_date")).otherwise(
F.col("end_of_week")
),
)
)
#probably agg is not needed here, i wasnt able to run it without any agg, need to figure out somethin better
dfExplodedAndAlligned.groupBy(
F.col("client"),
F.col("family_code"),
F.col("start_date"),
F.col("end_date"),
F.col("start_of_week"),
F.col("end_of_week"),
).agg(F.max("start_date")).orderBy("start_of_week").drop("max(start_date)").show()
输出:
+------+-----------+----------+----------+-------------------+-----------+
|client|family_code|start_date| end_date| start_of_week|end_of_week|
+------+-----------+----------+----------+-------------------+-----------+
| 0001| 1111|2019-10-10|2019-11-06|2019-10-10 00:00:00| 2019-10-13|
| 0001| 1111|2019-10-10|2019-11-06|2019-10-14 00:00:00| 2019-10-20|
| 0001| 1111|2019-10-10|2019-11-06|2019-10-21 00:00:00| 2019-10-27|
| 0001| 1111|2019-10-10|2019-11-06|2019-10-28 00:00:00| 2019-11-03|
| 0001| 1111|2019-10-10|2019-11-06|2019-11-04 00:00:00| 2019-11-06|
| 0003| 1113|2020-09-24|2020-10-21|2020-09-24 00:00:00| 2020-09-27|
| 0003| 1113|2020-09-24|2020-10-21|2020-09-28 00:00:00| 2020-10-04|
| 0003| 1113|2020-09-24|2020-10-21|2020-10-05 00:00:00| 2020-10-11|
| 0003| 1113|2020-09-24|2020-10-21|2020-10-12 00:00:00| 2020-10-18|
| 0003| 1113|2020-09-24|2020-10-21|2020-10-19 00:00:00| 2020-10-21|
| 0002| 1112|2020-11-26|2021-01-06|2020-11-26 00:00:00| 2020-11-29|
| 0002| 1112|2020-11-26|2021-01-06|2020-11-30 00:00:00| 2020-12-06|
| 0002| 1112|2020-11-26|2021-01-06|2020-12-07 00:00:00| 2020-12-13|
| 0002| 1112|2020-11-26|2021-01-06|2020-12-14 00:00:00| 2020-12-20|
| 0002| 1112|2020-11-26|2021-01-06|2020-12-21 00:00:00| 2020-12-27|
| 0002| 1112|2020-11-26|2021-01-06|2020-12-28 00:00:00| 2021-01-03|
| 0002| 1112|2020-11-26|2021-01-06|2021-01-04 00:00:00| 2021-01-06|
+------+-----------+----------+----------+-------------------+-----------+