我很难从使用R数据过渡到使用R数据.使用Pandas 进行数据搜索的表.

具体来说,我试图将聚合结果作为一个新列重新分配到原始df中.注意,聚合是两列的函数,所以我认为df.transform()不是正确的方法.

为了说明这一点,我试图用以下方法复制我在R中的做法:

library(data.table)

df = as.data.table(read.csv(text=
"id,term,node,hours,price
1,qtr,A,300,107
2,qtr,A,300,104
3,qtr,A,300,91
4,qtr,B,300,89
5,qtr,B,300,113
6,qtr,B,300,116
7,mth,A,50,110
8,mth,A,100,119
9,mth,A,150,99
10,mth,B,50,111
11,mth,B,100,106
12,mth,B,150,108"))

df[term == 'qtr' , `:=`(vwap_ish = sum(hours * price),
                        avg_id = mean(id) ),
    .(node, term)]

df

# id term node hours price vwap_ish avg_id
# 1:  1  qtr    A   300   107    90600      2
# 2:  2  qtr    A   300   104    90600      2
# 3:  3  qtr    A   300    91    90600      2
# 4:  4  qtr    B   300    89    95400      5
# 5:  5  qtr    B   300   113    95400      5
# 6:  6  qtr    B   300   116    95400      5
# 7:  7  mth    A    50   110       NA     NA
# 8:  8  mth    A   100   119       NA     NA
# 9:  9  mth    A   150    99       NA     NA
# 10: 10  mth    B    50   111       NA     NA
# 11: 11  mth    B   100   106       NA     NA
# 12: 12  mth    B   150   108       NA     NA

使用P和as,我可以从df创建一个对象,该对象包含原始df的所有行以及聚合

import io
import numpy as np
import p和as as pd

data = io.StringIO("""id,term,node,hours,price
1,qtr,A,300,107
2,qtr,A,300,104
3,qtr,A,300,91
4,qtr,B,300,89
5,qtr,B,300,113
6,qtr,B,300,116
7,mth,A,50,110
8,mth,A,100,119
9,mth,A,150,99
10,mth,B,50,111
11,mth,B,100,106
12,mth,B,150,108""")

df = pd.read_csv(data)

df1 = df.groupby(['node','term']).apply(
        lambda gp: gp.assign(
                vwap_ish = (gp.hours * gp.price).sum(),
                avg_id = np.mean(gp.id)
                )
        )

df1

"""
              id term node  hours  price  vwap_ish  avg_id
node term                                                 
ben  mth  9   10  mth  B     50    111     32350    10.0
          10  11  mth  B    100    106     32350    10.0
          11  12  mth  B    150    108     32350    10.0
     qtr  3    4  qtr  B    300     89     95400     4.0
          4    5  qtr  B    300    113     95400     4.0
          5    6  qtr  B    300    116     95400     4.0
ota  mth  6    7  mth  A     50    110     32250     7.0
          7    8  mth  A    100    119     32250     7.0
          8    9  mth  A    150     99     32250     7.0
     qtr  0    1  qtr  A    300    107     90600     1.0
          1    2  qtr  A    300    104     90600     1.0
          2    3  qtr  A    300     91     90600     1.0
"""

这并不能真正满足我的需求,因为a)它重新排序并创建索引,b)它计算了所有行的聚合.

我可以很容易地得到子集


df2 = df[df.term=='qtr'].groupby(['node','term']).apply(
        lambda gp: gp.assign(
                vwap_ish = (gp.hours * gp.price).sum(),
                avg_id = np.mean(gp.id)
                )

df2

"""
             id term node  hours  price  vwap_ish  avg_id
node term                                                
A    qtr  0   1  qtr    A    300    107     90600     1.0
          1   2  qtr    A    300    104     90600     1.0
          2   3  qtr    A    300     91     90600     1.0
B    qtr  3   4  qtr    B    300     89     95400     4.0
          4   5  qtr    B    300    113     95400     4.0
          5   6  qtr    B    300    116     95400     4.0
"""

但我无法将新列(vwap_ish,avg_id)中的值返回到旧的df中.

我试过:

df[df.term=='qtr'] = df[df.term == 'qtr'].groupby(['node','term']).apply(
        lambda gp: gp.assign(
                vwap_ish = (gp.hours * gp.price).sum(),
                avg_id = np.mean(gp.id)
                )
        )

ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'

还有一些变体.合并并删除.参加例如:

df.merge(df2, how='left')

ValueError: 'term' is both an index level 和 a column label, which is ambiguous.

df.merge(df2, how='left', on=df.columns)

KeyError: Index(['id', 'term', 'node', 'hours', 'price'], dtype='object')

In writing this I realised I could take my first approach 和 then just do

df[df.term=='qtr', ['vwap_ish','avg_id']] = NaN

but this seems quite hacky. It means I have to use a new column, rather than overwriting an existing one on the filtered rows, 和 if the aggregation function were to break say if term='mth' then that would be problematic too.

I'd really appreciate any help with this as it's been a very steep learning curve to try to make the transition from data.table to P和as 和 there's so much I would do in a one-liner that is taking me hours to figure out.

推荐答案

您可以为remove MultiIndex添加group_keys=False个参数,这样left join就可以正常工作了:

df2 = df[df.term == 'qtr'].groupby(['node','term'], group_keys=False).apply(
        lambda gp: gp.assign(
                vwap_ish = (gp.hours * gp.price).sum(),
                avg_id = np.mean(gp.id)
                )
        )

df = df.merge(df2, how='left')
print (df)
    id term node  hours  price  vwap_ish  avg_id
0    1  qtr    A    300    107   90600.0     2.0
1    2  qtr    A    300    104   90600.0     2.0
2    3  qtr    A    300     91   90600.0     2.0
3    4  qtr    B    300     89   95400.0     5.0
4    5  qtr    B    300    113   95400.0     5.0
5    6  qtr    B    300    116   95400.0     5.0
6    7  mth    A     50    110       NaN     NaN
7    8  mth    A    100    119       NaN     NaN
8    9  mth    A    150     99       NaN     NaN
9   10  mth    B     50    111       NaN     NaN
10  11  mth    B    100    106       NaN     NaN
11  12  mth    B    150    108       NaN     NaN

没有左连接的解决方案:

m = df.term == 'qtr'
df.loc[m, ['vwap_ish','avg_id']] = (df[m].groupby(['node','term'], group_keys=False)
                                        .apply(lambda gp: gp.assign(
                                                     vwap_ish = (gp.hours * gp.price).sum(),
                                                     avg_id = np.mean(gp.id)
                                                      )
                                               ))

使用命名聚合和在groupby之前创建vwap_ish列的改进解决方案可以提高性能:

df2 = (df[df.term == 'qtr']
         .assign(vwap_ish = lambda x: x.hours * x.price)
         .groupby(['node','term'], as_index=False)
         .agg(vwap_ish=('vwap_ish','sum'),
              avg_id=('id','mean')))

df = df.merge(df2, how='left')
print (df)
    id term node  hours  price  vwap_ish  avg_id
0    1  qtr    A    300    107   90600.0     2.0
1    2  qtr    A    300    104   90600.0     2.0
2    3  qtr    A    300     91   90600.0     2.0
3    4  qtr    B    300     89   95400.0     5.0
4    5  qtr    B    300    113   95400.0     5.0
5    6  qtr    B    300    116   95400.0     5.0
6    7  mth    A     50    110       NaN     NaN
7    8  mth    A    100    119       NaN     NaN
8    9  mth    A    150     99       NaN     NaN
9   10  mth    B     50    111       NaN     NaN
10  11  mth    B    100    106       NaN     NaN
11  12  mth    B    150    108       NaN     NaN

Python相关问答推荐

在Python中对分层父/子列表进行排序

具有多个选项的计数_匹配

根据另一列中的nan重置值后重新加权Pandas列

海运图:调整行和列标签

计算组中唯一值的数量

两个pandas的平均值按元素的结果串接元素.为什么?

如何更改分组条形图中条形图的 colored颜色 ?

将tdqm与cx.Oracle查询集成

优化器的运行顺序影响PyTorch中的预测

pandas:排序多级列

Pandas—在数据透视表中占总数的百分比

Pandas Data Wrangling/Dataframe Assignment

如果包含特定值,则筛选Groupby

如何使用正则表达式修改toml文件中指定字段中的参数值

python的文件. truncate()意外地没有截断'

.awk文件可以使用子进程执行吗?

函数()参数';代码';必须是代码而不是字符串

使用美汤对维基百科表格进行网络刮擦未返回任何内容

将索引表转换为Numy数组

如何从具有完整层次数据的Pandas框架生成图形?