This is the dataset I'm using:个
country_or_area | year | comm_code | commodity | flow | trade_usd | weight_kg | quantity_name | quantity | category |
---|---|---|---|---|---|---|---|---|---|
Belgium | 2016 | 920510 | Brass-wind instru... | Export | 571297 | 3966.0 | Number of items | 4135.0 | 92_musical_instru... |
Guatemala | 2008 | 660200 | Walking-sticks, s... | Export | 35022 | 5575.0 | Number of items | 10089.0 | 66_umbrellas_walk... |
Barbados | 2006 | 220210 | Beverage waters, ... | Re-Export | 81058 | 44458.0 | Volume in litres | 24113.0 | 22_beverages_spir... |
Tunisia | 2016 | 780411 | Lead foil of a th... | Import | 4658 | 121.0 | Weight in kilograms | 121.0 | 78_lead_and_artic... |
Lithuania | 1996 | 560110 | Sanitary towels, ... | Export | 76499 | 5419.0 | Weight in kilograms | 5419.0 | 56_wadding_felt_n... |
This is the question I need to answer:个
2016年(按流量类型)商业化程度最高的商品(按出现次数计算)
I need to group by flow only, but I don't know how can I do it.个
query = '''
SELECT flow, commodity, MAX(quantity) quantity
FROM (
SELECT flow, commodity, COUNT(*) quantity
FROM transactions
WHERE year = 2016
GROUP BY flow, commodity
)
GROUP BY flow
'''
spark.sql(query).show(10)
The result I'm expecting is something like this:个
[('Export', ('Sweet biscuits, waffles and wafers', 24)),
('Import', ('Baking powders, prepared', 27)),
('Re-Export', ('Glues or adhesives, prepared nes, package > 1kg', 8)),
('Re-Import', ('Footwear,sole rubber/plastic,upper textile, not sport', 5))]