我很难从使用R数据过渡到使用R数据.使用Pandas 进行数据搜索的表.
具体来说,我试图将聚合结果作为一个新列重新分配到原始df中.注意,聚合是两列的函数,所以我认为df.transform()
不是正确的方法.
为了说明这一点,我试图用以下方法复制我在R中的做法:
library(data.table)
df = as.data.table(read.csv(text=
"id,term,node,hours,price
1,qtr,A,300,107
2,qtr,A,300,104
3,qtr,A,300,91
4,qtr,B,300,89
5,qtr,B,300,113
6,qtr,B,300,116
7,mth,A,50,110
8,mth,A,100,119
9,mth,A,150,99
10,mth,B,50,111
11,mth,B,100,106
12,mth,B,150,108"))
df[term == 'qtr' , `:=`(vwap_ish = sum(hours * price),
avg_id = mean(id) ),
.(node, term)]
df
# id term node hours price vwap_ish avg_id
# 1: 1 qtr A 300 107 90600 2
# 2: 2 qtr A 300 104 90600 2
# 3: 3 qtr A 300 91 90600 2
# 4: 4 qtr B 300 89 95400 5
# 5: 5 qtr B 300 113 95400 5
# 6: 6 qtr B 300 116 95400 5
# 7: 7 mth A 50 110 NA NA
# 8: 8 mth A 100 119 NA NA
# 9: 9 mth A 150 99 NA NA
# 10: 10 mth B 50 111 NA NA
# 11: 11 mth B 100 106 NA NA
# 12: 12 mth B 150 108 NA NA
使用P和as,我可以从df创建一个对象,该对象包含原始df的所有行以及聚合
import io
import numpy as np
import p和as as pd
data = io.StringIO("""id,term,node,hours,price
1,qtr,A,300,107
2,qtr,A,300,104
3,qtr,A,300,91
4,qtr,B,300,89
5,qtr,B,300,113
6,qtr,B,300,116
7,mth,A,50,110
8,mth,A,100,119
9,mth,A,150,99
10,mth,B,50,111
11,mth,B,100,106
12,mth,B,150,108""")
df = pd.read_csv(data)
df1 = df.groupby(['node','term']).apply(
lambda gp: gp.assign(
vwap_ish = (gp.hours * gp.price).sum(),
avg_id = np.mean(gp.id)
)
)
df1
"""
id term node hours price vwap_ish avg_id
node term
ben mth 9 10 mth B 50 111 32350 10.0
10 11 mth B 100 106 32350 10.0
11 12 mth B 150 108 32350 10.0
qtr 3 4 qtr B 300 89 95400 4.0
4 5 qtr B 300 113 95400 4.0
5 6 qtr B 300 116 95400 4.0
ota mth 6 7 mth A 50 110 32250 7.0
7 8 mth A 100 119 32250 7.0
8 9 mth A 150 99 32250 7.0
qtr 0 1 qtr A 300 107 90600 1.0
1 2 qtr A 300 104 90600 1.0
2 3 qtr A 300 91 90600 1.0
"""
这并不能真正满足我的需求,因为a)它重新排序并创建索引,b)它计算了所有行的聚合.
我可以很容易地得到子集
df2 = df[df.term=='qtr'].groupby(['node','term']).apply(
lambda gp: gp.assign(
vwap_ish = (gp.hours * gp.price).sum(),
avg_id = np.mean(gp.id)
)
df2
"""
id term node hours price vwap_ish avg_id
node term
A qtr 0 1 qtr A 300 107 90600 1.0
1 2 qtr A 300 104 90600 1.0
2 3 qtr A 300 91 90600 1.0
B qtr 3 4 qtr B 300 89 95400 4.0
4 5 qtr B 300 113 95400 4.0
5 6 qtr B 300 116 95400 4.0
"""
但我无法将新列(vwap_ish,avg_id)中的值返回到旧的df中.
我试过:
df[df.term=='qtr'] = df[df.term == 'qtr'].groupby(['node','term']).apply(
lambda gp: gp.assign(
vwap_ish = (gp.hours * gp.price).sum(),
avg_id = np.mean(gp.id)
)
)
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'
还有一些变体.合并并删除.参加例如:
df.merge(df2, how='left')
ValueError: 'term' is both an index level 和 a column label, which is ambiguous.
和
df.merge(df2, how='left', on=df.columns)
KeyError: Index(['id', 'term', 'node', 'hours', 'price'], dtype='object')
In writing this I realised I could take my first approach 和 then just do
df[df.term=='qtr', ['vwap_ish','avg_id']] = NaN
but this seems quite hacky. It means I have to use a new column, rather than overwriting an existing one on the filtered rows, 和 if the aggregation function were to break say if term='mth' then that would be problematic too.
I'd really appreciate any help with this as it's been a very steep learning curve to try to make the transition from data.table to P和as 和 there's so much I would do in a one-liner that is taking me hours to figure out.