我有一个数据集,看起来像下面的:
ITEM CITY START_Y START_W FIRST_USE_Y FIRST_USE_W VALUE
A NEW YORK 2023 30 2023 32 15000
A LONDON 2024 2 2024 2 12000
A LONDON 2024 2 2024 5 50000
B NEW YORK 2023 49 2024 1 19540
B MADRID 2023 10 2023 11 15444
首先,项目和城市的组合需要分组.然后,对于每个组,我希望每周重新采样最多5个数据点,并在FIRST_USE_Y和FIRST_USE_W列组合没有值的情况下用零填充'PRODUCT'列. START_W和FIRST_USE_W是一年的周数(值可以从1到52).
我试过pandas和for循环,效果很好.但由于它是一个非常大的数据集,有数百万行,我一定会使用SQL(我是一个新手).这是我试过的代码:
WITH RECURSIVE weekly_intervals AS (
SELECT MIN(start_w) AS start_w, MAX(start_w) AS end_w
FROM citywise_values
UNION ALL
SELECT start_w + INTERVAL 1 WEEK, end_w
FROM weekly_intervals
WHERE start_w + INTERVAL 1 WEEK <= end_w
),
filled_values AS (
SELECT
w.item,
w.city,
w.start_y,
w.start_w,
COALESCE(cv.value, 0) AS value
FROM
(SELECT
item,
city,
start_y,
start_w
FROM
citywise_values
GROUP BY
item, city) w
LEFT JOIN
citywise_values cv ON w.item = cv.item
AND w.city = cv.city
AND w.start_y = cv.start_y
AND w.start_w = cv.start_w
)
SELECT
item,
city,
start_y,
start_w,
COALESCE(value, LAG(value) OVER (PARTITION BY item, city, start_y ORDER BY start_w)) AS value
FROM
filled_values
RIGHT JOIN
weekly_intervals
ON
filled_values.start_w = weekly_intervals.start_w
ORDER BY
item, city, start_y, start_w
然后,我try 了一个交叉连接,并能够产生的结果,只有一个单一的项目和城市组合.但我无法找到如何处理整个数据集.
我不确定我能解释得好还是不好.因此,我发布了我手动创建的所需输出.
ITEM CITY START_Y START_W FIRST_USE_Y FIRST_USE_W VALUE
A NEW YORK 2023 30 2023 30 0
A NEW YORK 2023 30 2023 31 0
A NEW YORK 2023 30 2023 32 15000
A NEW YORK 2023 30 2023 33 0
A NEW YORK 2023 30 2023 34 0
A LONDON 2024 2 2024 2 12000
A LONDON 2024 2 2024 3 0
A LONDON 2024 2 2024 4 0
A LONDON 2024 2 2024 5 50000
A LONDON 2024 2 2024 6 0
B NEW YORK 2023 49 2023 49 0
B NEW YORK 2023 49 2023 50 0
B NEW YORK 2023 49 2023 51 0
B NEW YORK 2023 49 2023 52 0
B NEW YORK 2023 49 2024 1 19540
B MADRID 2023 10 2023 10 0
B MADRID 2023 10 2023 11 15444
B MADRID 2023 10 2023 12 0
B MADRID 2023 10 2023 13 0
B MADRID 2023 10 2023 14 0
任何帮助将不胜感激.