Python 如何从 struct 有点奇怪的 HTML 中获取文本

发布于09月16日

我有一个网站，里面有这样的HTML struct :

<div class="ui-rectframe">
    <p class="ui-li-desc"></p>
    <h4 class="ui-li-heading">Qualifications</h4>
    MBBS (University of Singapore, Singapore) 1978
    <br>
    MCFP (Family Med) (College of Family Physicians, Singapore) 1984
    <br>
    Dip Geriatric Med (NUS, Singapore) 2012
    <br>
    GDPM (NUS, Singapore) 2015
    <br>
    <h4 class="ui-li-heading">Type of first registration / date</h4>
    Full Registration (14/06/1979)<br>
    <h4 class="ui-li-heading">Type of current registration / date</h4>
    Full Registration (14/06/1979)<br>
    <h4 class="ui-li-heading">Practising Certificate Start Date</h4>
    01/01/2022<br>
    <h4 class="ui-li-heading">Practising Certificate End Date</h4>
    31/12/2023<br>
    <p></p><br>
</div>

我需要提取资格-- [ 'MBBS (University of Singapore, Singapore) 1978', 'MCFP (Family Med) (College of Family Physicians, Singapore) 1984', 'Dip Geriatric Med (NUS, Singapore) 2012', 'GDPM (NUS, Singapore) 2015' ] 我如何使用CSS Select 器或XPath来实现这一点？我可以提取父div中的所有文本项，但不能将资格与其他值(如首次注册的类型等)分开.

def create_dict(strings, headers): idx = 0 d = {} for header in headers: sublist = [] while strings[idx] != header: sublist.append(strings[idx]) idx += 1 if sublist: d.update({sublist[0]:sublist[1:]}) return(d) h = [e.get_text(strip=True) for e in soup.select('div h4')] s = list(soup.div.stripped_strings) create_dict(s,h)

{'Qualifications': ['MBBS (University of Singapore, Singapore) 1978', 'MCFP (Family Med) (College of Family Physicians, Singapore) 1984', 'Dip Geriatric Med (NUS, Singapore) 2012', 'GDPM (NUS, Singapore) 2015'], 'Type of first registration / date': ['Full Registration (14/06/1979)'], 'Type of current registration / date': ['Full Registration (14/06/1979)'], 'Practising Certificate Start Date': ['01/01/2022']}

from bs4 import BeautifulSoup html = ''' <div class="ui-rectframe"> <p class="ui-li-desc"></p> <h4 class="ui-li-heading">Qualifications</h4> MBBS (University of Singapore, Singapore) 1978 <br> MCFP (Family Med) (College of Family Physicians, Singapore) 1984 <br> Dip Geriatric Med (NUS, Singapore) 2012 <br> GDPM (NUS, Singapore) 2015 <br> <h4 class="ui-li-heading">Type of first registration / date</h4> Full Registration (14/06/1979)<br> <h4 class="ui-li-heading">Type of current registration / date</h4> Full Registration (14/06/1979)<br> <h4 class="ui-li-heading">Practising Certificate Start Date</h4> 01/01/2022<br> <h4 class="ui-li-heading">Practising Certificate End Date</h4> 31/12/2023<br> <p></p><br> </div> ''' soup = BeautifulSoup(html) def create_dict(strings, headers): idx = 0 d = {} for header in headers: sublist = [] while strings[idx] != header: sublist.append(strings[idx]) idx += 1 if sublist: d.update({sublist[0]:sublist[1:]}) return(d) h = [e.get_text(strip=True) for e in soup.select('div h4')] s = list(soup.div.stripped_strings) create_dict(s,h)

Python 如何从 struct 有点奇怪的 HTML 中获取文本

推荐答案

Example

Python相关问答推荐

Python：在类对象内的字典中更改所有键的索引，而不是仅更改一个键

Pythind 11无法弄清楚如何访问tuple元素

在Python中对分层父/子列表进行排序

如何检测背景有噪的图像中的正方形

Pandas 滚动最接近的价值

如何将双框框列中的成对变成两个新列

Python中的嵌套Ruby哈希

在Python Attrs包中，如何在field_Transformer函数中添加字段？

用合并列替换现有列并重命名

数据抓取失败：寻求帮助

Python，Fitting into a System of Equations

如何将一个动态分配的C数组转换为Numpy数组，并在C扩展模块中返回给Python

在极性中创建条件累积和

用砂箱开发Web统计分析

Django—cte给出：QuerySet对象没有属性with_cte''''

如何检测鼠标/键盘的空闲时间，而不是其他输入设备？

Numpyro AR(1)均值切换模型抽样不一致性

在二维NumPy数组中，如何 Select 内部数组的第一个和第二个元素？这可以通过索引来实现吗？

从源代码显示不同的输出(机器学习)(Python)

jsonschema日期格式