我有一个具有以下 struct 的html文件.如您所见,标题和每个标题下面的匹配项没有分组到它们自己的div中.

<div class="basketball">
    
    <div class="header">
        <span class="event_title">Playoff</span>
    </div>
    <div class="match">
        <div class="home">Bakken</div>
        <div class="away">Akken</div>
        <div class="home score">90</div>
        <div class="away score">70</div>
    </div>
    <div class="match">
        <div class="home">Monaco</div>
        <div class="away">Strasbourg</div>
        <div class="home score">80</div>
        <div class="away score">65</div>
    </div>
    
    <div class="header">
        <span class="event_title">Semi Finals</span>
    </div>
    <div class="match">
        <div class="home">Randers</div>
        <div class="away">Celtics</div>
        <div class="home score">60</div>
        <div class="away score">90</div>
    </div>
    <div class="match">
        <div class="home">Randers</div>
        <div class="away">Celtics</div>
        <div class="home score">60</div>
        <div class="away score">90</div>
    </div>
    
    <div class="header">
        <span class="event_title">Quarter Finals</span>
    </div>
        <div class="match">
        <div class="home">Randers</div>
        <div class="away">Celtics</div>
        <div class="home score">60</div>
        <div class="away score">90</div>
    </div>
    <div class="match">
        <div class="home">Randers</div>
        <div class="away">Celtics</div>
        <div class="home score">60</div>
        <div class="away score">90</div>
    </div>
    <div class="match">
        <div class="home">Randers</div>
        <div class="away">Celtics</div>
        <div class="home score">60</div>
        <div class="away score">90</div>
    </div>
    <div class="match">
        <div class="home">Randers</div>
        <div class="away">Celtics</div>
        <div class="home score">60</div>
        <div class="away score">90</div>
    </div>
    
    <div class="header">
        <span class="event_title">Normal Season Matches</span>
    </div>
    <div class="match">
        <div class="home">Randers</div>
        <div class="away">Celtics</div>
        <div class="home score">60</div>
        <div class="away score">90</div>
    </div>
    <div class="match">
        <div class="home">Randers</div>
        <div class="away">Celtics</div>
        <div class="home score">60</div>
        <div class="away score">90</div>
    </div>
    <div class="match">
        <div class="home">Randers</div>
        <div class="away">Celtics</div>
        <div class="home score">60</div>
        <div class="away score">90</div>
    </div>
    <div class="match">
        <div class="home">Randers</div>
        <div class="away">Celtics</div>
        <div class="home score">60</div>
        <div class="away score">90</div>
    </div>
    <div class="match">
        <div class="home">Randers</div>
        <div class="away">Celtics</div>
        <div class="home score">60</div>
        <div class="away score">90</div>
    </div>
    <div class="match">
        <div class="home">Randers</div>
        <div class="away">Celtics</div>
        <div class="home score">60</div>
        <div class="away score">90</div>
    </div>
    <div class="match">
        <div class="home">Randers</div>
        <div class="away">Celtics</div>
        <div class="home score">60</div>
        <div class="away score">90</div>
    </div>
    <div class="match">
        <div class="home">Randers</div>
        <div class="away">Celtics</div>
        <div class="home score">60</div>
        <div class="away score">90</div>
    </div>
    <div class="match">
        <div class="home">Randers</div>
        <div class="away">Celtics</div>
        <div class="home score">60</div>
        <div class="away score">90</div>
    </div>
    
</div>

我想将标题和每个标题下面的匹配项分组到单独的div中,但这并不经济,因为原始文件有1000多行html标记.

我需要提取数据,以便输出如下所示:

data = {Playoff: [["Bakken", "Akken", 90, 70], ["Monaco", "Strasbourg", 80, 65]], 
        Semi Finals: [["Randers", "Celtics", 60, 90], [...]]
        Quarter Finals: [.... ],
        Normal Season Matches: [.... ]}

我做的第一部分是:

data = {}

for i in soup.find_all("div", class_="header"):
    title = i.find("span", class_="event_title").get_text()
    data[title] = []

data

# output
{'Playoff': [],
 'Semi Finals': [],
 'Quarter Finals': [],
 'Normal Season Matches': []}

我想不出如何在单子上填上正确的匹配项.任何帮助我们都将不胜感激.

推荐答案

如果html_text包含问题中的HTML,您可以这样做:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_text, "html.parser")

out = {}
for m in soup.select("div.match"):
    data = [div.text for div in m.select("div")]
    header = m.find_previous(class_="header").text.strip()
    out.setdefault(header, []).append(data)

print(out)

打印:

{
    "Playoff": [["Bakken", "Akken", "90", "70"], ["Monaco", "Strasbourg", "80", "65"]],
    "Semi Finals": [
        ["Randers", "Celtics", "60", "90"],
        ["Randers", "Celtics", "60", "90"],
    ],
    "Quarter Finals": [
        ["Randers", "Celtics", "60", "90"],
        ["Randers", "Celtics", "60", "90"],
        ["Randers", "Celtics", "60", "90"],
        ["Randers", "Celtics", "60", "90"],
    ],
    "Normal Season Matches": [
        ["Randers", "Celtics", "60", "90"],
        ["Randers", "Celtics", "60", "90"],
        ["Randers", "Celtics", "60", "90"],
        ["Randers", "Celtics", "60", "90"],
        ["Randers", "Celtics", "60", "90"],
        ["Randers", "Celtics", "60", "90"],
        ["Randers", "Celtics", "60", "90"],
        ["Randers", "Celtics", "60", "90"],
        ["Randers", "Celtics", "60", "90"],
    ],
}

Python相关问答推荐

numba jitClass,记录类型为字符串

rame中不兼容的d类型

处理(潜在)不断增长的任务队列的并行/并行方法

运行终端命令时出现问题:pip start anonymous"

如何在Polars中从列表中的所有 struct 中 Select 字段?

使用NeuralProphet绘制置信区间时出错

Pandas Loc Select 到NaN和值列表

用砂箱开发Web统计分析

计算天数

在两极中过滤

使用特定值作为引用替换数据框行上的值

matplotlib + python foor loop

将标签移动到matplotlib饼图中楔形块的开始处

具有相同图例 colored颜色 和标签的堆叠子图

提高算法效率的策略?

pandas fill和bfill基于另一列中的条件

如何在Python Pandas中填充外部连接后的列中填充DDL值

使用SeleniumBase保存和加载Cookie时出现问题

你能把函数的返回类型用作其他地方的类型吗?'

将像素信息写入文件并读取该文件