我有一个网站,里面有这样的HTML struct :

<div class="ui-rectframe">
    <p class="ui-li-desc"></p>
    <h4 class="ui-li-heading">Qualifications</h4>
    MBBS (University of Singapore, Singapore) 1978
    <br>
    MCFP (Family Med) (College of Family Physicians, Singapore) 1984
    <br>
    Dip Geriatric Med (NUS, Singapore) 2012
    <br>
    GDPM (NUS, Singapore) 2015
    <br>
    <h4 class="ui-li-heading">Type of first registration / date</h4>
    Full Registration (14/06/1979)<br>
    <h4 class="ui-li-heading">Type of current registration / date</h4>
    Full Registration (14/06/1979)<br>
    <h4 class="ui-li-heading">Practising Certificate Start Date</h4>
    01/01/2022<br>
    <h4 class="ui-li-heading">Practising Certificate End Date</h4>
    31/12/2023<br>
    <p></p><br>
</div>

我需要提取资格-- [ 'MBBS (University of Singapore, Singapore) 1978', 'MCFP (Family Med) (College of Family Physicians, Singapore) 1984', 'Dip Geriatric Med (NUS, Singapore) 2012', 'GDPM (NUS, Singapore) 2015' ] 我如何使用CSS Select 器或XPath来实现这一点?我可以提取父div中的所有文本项,但不能将资格与其他值(如首次注册的类型等)分开.

推荐答案

您可以提取list个标头和stripped_strings个标头中的一个,并使用一个函数通过判断标头来分隔它们:

def create_dict(strings, headers):
    idx = 0
    d = {}
    for header in headers:
        sublist = []
        while strings[idx] != header:
            sublist.append(strings[idx])
            idx += 1
        if sublist:
            d.update({sublist[0]:sublist[1:]})
    return(d)

h = [e.get_text(strip=True) for e in soup.select('div h4')]
s = list(soup.div.stripped_strings)

create_dict(s,h)

输出:

注意-这将在dict中存储结果,如果需要,还可以从其他部分中 Select :

{'Qualifications': ['MBBS (University of Singapore, Singapore) 1978',
  'MCFP (Family Med) (College of Family Physicians, Singapore) 1984',
  'Dip Geriatric Med (NUS, Singapore) 2012',
  'GDPM (NUS, Singapore) 2015'],
 'Type of first registration / date': ['Full Registration (14/06/1979)'],
 'Type of current registration / date': ['Full Registration (14/06/1979)'],
 'Practising Certificate Start Date': ['01/01/2022']}

Example

from bs4 import BeautifulSoup

html = '''
<div class="ui-rectframe">
    <p class="ui-li-desc"></p>
    <h4 class="ui-li-heading">Qualifications</h4>
    MBBS (University of Singapore, Singapore) 1978
    <br>
    MCFP (Family Med) (College of Family Physicians, Singapore) 1984
    <br>
    Dip Geriatric Med (NUS, Singapore) 2012
    <br>
    GDPM (NUS, Singapore) 2015
    <br>
    <h4 class="ui-li-heading">Type of first registration / date</h4>
    Full Registration (14/06/1979)<br>
    <h4 class="ui-li-heading">Type of current registration / date</h4>
    Full Registration (14/06/1979)<br>
    <h4 class="ui-li-heading">Practising Certificate Start Date</h4>
    01/01/2022<br>
    <h4 class="ui-li-heading">Practising Certificate End Date</h4>
    31/12/2023<br>
    <p></p><br>
</div>
'''
soup = BeautifulSoup(html)

def create_dict(strings, headers):
    idx = 0
    d = {}
    for header in headers:
        sublist = []
        while strings[idx] != header:
            sublist.append(strings[idx])
            idx += 1
        if sublist:
            d.update({sublist[0]:sublist[1:]})
    return(d)

h = [e.get_text(strip=True) for e in soup.select('div h4')]
s = list(soup.div.stripped_strings)

create_dict(s,h)

Python相关问答推荐

Python:在类对象内的字典中更改所有键的索引,而不是仅更改一个键

Pythind 11无法弄清楚如何访问tuple元素

在Python中对分层父/子列表进行排序

如何检测背景有噪的图像中的正方形

Pandas 滚动最接近的价值

如何将双框框列中的成对变成两个新列

Python中的嵌套Ruby哈希

在Python Attrs包中,如何在field_Transformer函数中添加字段?

用合并列替换现有列并重命名

数据抓取失败:寻求帮助

Python,Fitting into a System of Equations

如何将一个动态分配的C数组转换为Numpy数组,并在C扩展模块中返回给Python

在极性中创建条件累积和

用砂箱开发Web统计分析

Django—cte给出:QuerySet对象没有属性with_cte''''

如何检测鼠标/键盘的空闲时间,而不是其他输入设备?

Numpyro AR(1)均值切换模型抽样不一致性

在二维NumPy数组中,如何 Select 内部数组的第一个和第二个元素?这可以通过索引来实现吗?

从源代码显示不同的输出(机器学习)(Python)

jsonschema日期格式