This is the dataset I'm using:

country_or_area year comm_code commodity flow trade_usd weight_kg quantity_name quantity category
Belgium 2016 920510 Brass-wind instru... Export 571297 3966.0 Number of items 4135.0 92_musical_instru...
Guatemala 2008 660200 Walking-sticks, s... Export 35022 5575.0 Number of items 10089.0 66_umbrellas_walk...
Barbados 2006 220210 Beverage waters, ... Re-Export 81058 44458.0 Volume in litres 24113.0 22_beverages_spir...
Tunisia 2016 780411 Lead foil of a th... Import 4658 121.0 Weight in kilograms 121.0 78_lead_and_artic...
Lithuania 1996 560110 Sanitary towels, ... Export 76499 5419.0 Weight in kilograms 5419.0 56_wadding_felt_n...

This is the question I need to answer:

2016年(按流量类型)商业化程度最高的商品(按出现次数计算)

I need to group by flow only, but I don't know how can I do it.

query = '''
        SELECT flow, commodity, MAX(quantity) quantity
        FROM (
          SELECT flow, commodity, COUNT(*) quantity
          FROM transactions
          WHERE year = 2016
          GROUP BY flow, commodity
        )
        GROUP BY flow
        '''
spark.sql(query).show(10)

The result I'm expecting is something like this:

[('Export', ('Sweet biscuits, waffles and wafers', 24)),
 ('Import', ('Baking powders, prepared', 27)),
 ('Re-Export', ('Glues or adhesives, prepared nes, package > 1kg', 8)),
 ('Re-Import', ('Footwear,sole rubber/plastic,upper textile, not sport', 5))]

推荐答案

你可以试一下ROW_NUMBER,按COUNT(*)订购.

SELECT flow, commodity, cnt
FROM (
  SELECT flow, commodity, 
         COUNT(*) AS cnt,
         ROW_NUMBER() OVER(PARTITION BY flow, commodity ORDER BY COUNT(*)) rn
  FROM transactions
  WHERE year = 2016
  GROUP BY flow, commodity
)
WHERE rn = 1

或者使用MAX,但在计数值等于最大值时进行比较

SELECT flow, commodity, cnt
FROM (
  SELECT flow, commodity, 
         COUNT(*) AS cnt,
         MAX(COUNT(*)) OVER(PARTITION BY flow, commodity) maxcnt
  FROM transactions
  WHERE year = 2016
  GROUP BY flow, commodity
)
WHERE cnt = maxcnt

尽管前者可能会表现得更好.

Sql相关问答推荐

如何在SQL Server中列出从当前月份开始的过go 10年中的月份

用于匹配红旗和绿旗的SQL查询

没有循环的SQL更新多个XML node 值

如何在PostgreSQL中对第1,1,1,1,2,2,2,2行进行编号

将日期时间转换为日期格式

正在try 从SQL获取最新的ID和一个唯一名称

嵌套Json对象的SQL UPDATE WHERE

从单个表达式中的分隔字符串中取平均值

在SQL中转换差异表的多列

递归 CTE 附加为行

如何对 jsonb 中的字段执行求和,然后使用结果过滤查询的输出

PostgreSQL中如何提取以特定字符开头的字符串中的所有单词?

SQL获取两个日期范围之间的计数

PostgreSQL - 从同一张表中获取值

在 SQL 查询中创建滚动日期

字符串从更改到表列和查询中的一行的转换错误

MIN MAX 值与条件绑定

如何将多行的查询结果合并为一行

条件意外地显着降低性能的地方

从不同的表中 Select 包含单词列表的记录