我有两个具有多个索引的DataFrame,分别命名为df_base
和df_updates
.我想将这些DataFrame合并到一个DataFrame中,并保留多个索引.
>>> import numpy as np
>>> import pandas as pd
>>> df_base = pd.DataFrame(
... {
... "price": {
... ("2019-01-01", "1001"): 100,
... ("2019-01-01", "1002"): 100,
... ("2019-01-01", "1003"): 100,
... ("2019-01-02", "1001"): 100,
... ("2019-01-02", "1002"): 100,
... ("2019-01-02", "1003"): 100,
... ("2019-01-03", "1001"): 100,
... ("2019-01-03", "1002"): 100,
... ("2019-01-03", "1003"): 100,
... }
... },
... )
>>> df_base.index.names = ["date", "id"]
>>> df_base.convert_dtypes()
price
date id
2019-01-01 1001 100
1002 100
1003 100
2019-01-02 1001 100
1002 100
1003 100
2019-01-03 1001 100
1002 100
1003 100
>>>
>>> df_updates = pd.DataFrame(
... {
... "price": {
... ("2019-01-01", "1001"): np.nan,
... ("2019-01-01", "1002"): 100,
... ("2019-01-01", "1003"): 100,
... ("2019-01-02", "1001"): 100,
... ("2019-01-02", "1002"): 100,
... ("2019-01-02", "1003"): 100,
... ("2019-01-03", "1001"): 100,
... ("2019-01-03", "1002"): 100,
... ("2019-01-03", "1003"): 100,
... }
... }
... )
>>> df_updates.index.names = ["date", "id"]
>>> df_updates.convert_dtypes()
price
date id
2019-01-01 1001 <NA>
1002 99
1003 99
1004 100
我想把它们与以下规则结合起来:
- 如果未指定新数据,则保留旧数据(NAN)
- 如果基本DataFrame中不存在索引,则追加新数据
我已经try 使用.join
,但出现错误
>>> df_base.join(df_updates)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[48], line 21
...
ValueError: columns overlap but no suffix specified: Index(['price'], dtype='object')
即使我添加了后缀,也只会使数据更加复杂(需要另一种解决方案)
我也try 了使用.update
,但结果中没有包含与基准指数不同的新数据
>>> df_base.update(df_updates)
>>> df_base
price
date id
2019-01-01 1001 100.0
1002 99.0
1003 99.0
2019-01-02 1001 100.0
1002 100.0
1003 100.0
2019-01-03 1001 100.0
1002 100.0
1003 100.0
最后,我还try 了一个"棘手"的操作
>>> df_base.update(df_updates)
>>> df_base = df_updates.combine_first(df_base)
>>> df_base
price
date id
2019-01-01 1001 100.0
1002 99.0
1003 99.0
1004 100.0
2019-01-02 1001 100.0
1002 100.0
1003 100.0
2019-01-03 1001 100.0
1002 100.0
1003 100.0
这是我预期的结果,但我不确定这是否是最好的解决方案,我try 使用%timeit
,结果是
>>> %timeit df_base.update(df_updates)
345 µs ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit df_updates.combine_first(df_base)
1.36 ms ± 10.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
使用 Big Data 时,结果为
>>> %timeit df_base.update(df_updates)
2.38 ms ± 180 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit df_updates.combine_first(df_base)
9.65 ms ± 400 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
对于我的情况,这是最好的解决方案吗?或者有没有更高效/更优化的功能(我期望的是单一的衬垫Pandas 功能)?谢谢!
编辑1:完整代码
import numpy as np
import pandas as pd
df_base = pd.DataFrame(
{
"price": {
("2019-01-01", "1001"): 100,
("2019-01-01", "1002"): 100,
("2019-01-01", "1003"): 100,
("2019-01-02", "1001"): 100,
("2019-01-02", "1002"): 100,
("2019-01-02", "1003"): 100,
("2019-01-03", "1001"): 100,
("2019-01-03", "1002"): 100,
("2019-01-03", "1003"): 100,
}
},
)
df_base.index.names = ["date", "id"]
df_base.convert_dtypes()
df_updates = pd.DataFrame(
{
"price": {
("2019-01-01", "1001"): np.nan,
("2019-01-01", "1002"): 100,
("2019-01-01", "1003"): 100,
("2019-01-02", "1001"): 100,
("2019-01-02", "1002"): 100,
("2019-01-02", "1003"): 100,
("2019-01-03", "1001"): 100,
("2019-01-03", "1002"): 100,
("2019-01-03", "1003"): 100,
}
}
)
df_updates.index.names = ["date", "id"]
df_updates.convert_dtypes()
df_base.update(df_updates)
df_base = df_updates.combine_first(df_base)
df_base