这里有两张桌子-

Employees
+----+-------+--------+--------------+
| id | name  | salary | departmentId |
+----+-------+--------+--------------+
| 1  | Joe   | 70000  | 1            |
| 2  | Jim   | 90000  | 1            |
| 3  | Henry | 80000  | 2            |
| 4  | Sam   | 60000  | 2            |
| 5  | Max   | 90000  | 1            |
+----+-------+--------+--------------+
Department table:
+----+-------+
| id | name  |
+----+-------+
| 1  | IT    |
| 2  | Sales |
+----+-------+
Required Output

Output: 
+------------+----------+--------+
| Department | Employee | Salary |
+------------+----------+--------+
| IT         | Jim      | 90000  |
| Sales      | Henry    | 80000  |
| IT         | Max      | 90000  |
+------------+----------+--------+

问题是找到每个部门的最高工资.

我的Python代码是这样的-

import pandas as pd

def department_highest_salary(employee: pd.DataFrame, department: pd.DataFrame) -> pd.DataFrame:
    df = employee.merge(department, how = "inner", left_on ="departmentId", right_on = "id")
    df["rank"] = employee.groupby("departmentId")["salary"].rank(method = "dense", ascending=False)
    return( df.query("rank == 1")[["name_y","name_x","salary"]].rename(columns = {"name_y":"Department", "name_x":"Employee", "salary":"Salary"}))

这是错误的,因为groupby.rank返回错误的结果-

   id_x name_x  salary  departmentId  id_y name_y  rank
0     1    Joe   70000             1     1     IT   2.0
1     2    Jim   90000             1     1     IT   1.0
2     5    Max   90000             1     1     IT   1.0
3     3  Henry   80000             2     2  Sales   2.0
4     4    Sam   60000             2     2  Sales   1.0

为什么Henry的排名是2,不应该是1吗?我到底做错了什么?

推荐答案

您需要做的唯一更改是:

  • employee.groupby更改为df.groupby.
  • 如果使用pd < 2.2.x,则使用how="left"而不是how="inner"来维持秩序.
def department_highest_salary(employee: pd.DataFrame, department: pd.DataFrame) -> pd.DataFrame:
    df = employee.merge(department, how = "left", left_on ="departmentId", right_on = "id")
    df["rank"] = df.groupby("departmentId")["salary"].rank(method = "dense", ascending=False)
    return( df.query("rank == 1")[["name_y","name_x","salary"]].rename(columns = {"name_y":"Department", "name_x":"Employee", "salary":"Salary"}))

department_highest_salary(employee, department)

  Department Employee  Salary
1         IT      Jim   90000
2      Sales    Henry   80000
4         IT      Max   90000

值为how="inner"(尽管在您的示例中,方法之间的 Select 没有其他material 效果),则pd < 2.2.x的结果将为:

  Department Employee  Salary
1         IT      Jim   90000
2         IT      Max   90000
3      Sales    Henry   80000

Problem

您对employee而不是合并后的df执行Groupby.这是一个问题,因为在pd < 2.2.xdf.merge(...)how="inner"中,不保持左键的顺序.

比较:

employee[['id', 'name']]

   id   name
0   1    Joe
1   2    Jim
2   3  Henry
3   4    Sam
4   5    Max

df[['id_x', 'name_x']] # the order of `id` values has changed

   id_x name_x
0     1    Joe
1     2    Jim
2     5    Max
3     3  Henry
4     4    Sam

也就是说,employee.groupby(...)确实给出了正确的values,但df现在将名称放在不同的行.由于对df["rank"]的赋值使用index个值进行映射,因此会出现不匹配的情况.

这实际上是一个bug:您的合并should保留了employee的顺序,因此您的代码should可以正常工作,甚至可以与employee.groupby一起工作.根据release notes,这一问题已在pd 2.2.x中得到解决.

Python相关问答推荐

如何计算两极打印机中 * 所有列 * 的出现次数?

比较两个数据帧并并排附加结果(获取性能警告)

比较2 PD.数组的令人惊讶的结果

可变参数数量的重载类型(args或kwargs)

基于索引值的Pandas DataFrame条件填充

ThreadPoolExecutor和单个线程的超时

Pandas DataFrame中行之间的差异

Python+线程\TrocessPoolExecutor

Flask Jinja2如果语句总是计算为false&

如何检测鼠标/键盘的空闲时间,而不是其他输入设备?

搜索按钮不工作,Python tkinter

基于另一列的GROUP-BY聚合将列添加到Polars LazyFrame

如何根据rame中的列值分别分组值

python3中np. divide(x,y)和x/y有什么区别?'

在matplotlib中重叠极 map 以创建径向龙卷风图

从列表中分离数据的最佳方式

使用pythonminidom过滤XML文件

普洛特利express 发布的人口普查数据失败

如何通过特定导入在类中执行Python代码

如何通过函数的强式路径动态导入函数?