我怎样才能把下面的 struct 删减到只有h1,h2&h3个元素高于<a>个标记呢

我想得到所有的<a>个标签标题放在上面的目标是<a>个标签在Beautiful Soup .

HTML码:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <meta http-equiv="X-UA-Compatible" content="ie=edge">
    <title>Document</title>
</head>
<body>
    <h1>Heading H1</h1>
    <p>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</p>
    <a href="#">Button</a>

    <hr>

    <h2>Heading H2</h2>
    <p>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</p>
    <p>
        <a href="#">Button</a>
    </p>

    <hr>

    <h3>Heading H3</h3>
    <p>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</p>
    <p>
        <a href="#">Button</a>
    </p>
    
    <hr>
</body>
</html>

我的代码:

from bs4 import BeautifulSoup
import requests

website = 'http://127.0.0.1:5500/test.html'
result = requests.get(website)
content = result.text

soup = BeautifulSoup(result.text)
# print(soup.prettify())

href_tags = ["a"]
for tags in soup.find_all(href_tags):
    print(tags.name + ' -> ' + tags.text.strip())

try 使用上面的代码时,它只显示<a>个标签文本.我还想得到<h1>,<h2><h3>的标签,这是放置在<a>的标签.

推荐答案

以下是获取这些信息的一种方式:

from bs4 import BeautifulSoup as bs
import pandas as pd

html = '''
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <meta http-equiv="X-UA-Compatible" content="ie=edge">
    <title>Document</title>
</head>
<body>
    <h1>Heading H1</h1>
    <p>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</p>
    <a href="#">Button</a>

    <hr>

    <h2>Heading H2</h2>
    <p>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</p>
    <p>
        <a href="#">Button</a>
    </p>

    <hr>

    <h3>Heading H3</h3>
    <p>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</p>
    <p>
        <a href="#">Button</a>
    </p>
    
    <hr>
</body>
</html>
'''
big_list = []
soup = bs(html, 'html.parser')

for link in soup.select('a'):
    link_text = link.get_text(strip=True)
    link_url = link.get('href')
    previous_header = [x.get_text(strip=True) for x in link.find_all_previous() if x.name in ['h1', 'h2', 'h3']][0]
    big_list.append((link_text, link_url, previous_header))
df = pd.DataFrame(big_list, columns=['link_text', 'link_url', 'previous_header_text'])
print(df)

结果为终端:

    link_text   link_url    previous_header_text
0   Button  #   Heading H1
1   Button  #   Heading H2
2   Button  #   Heading H3

请参阅BeautifulSoup文档here.

Python相关问答推荐

Locust请求中的Python和参数

如何使用Python将工作表从一个Excel工作簿复制粘贴到另一个工作簿?

SQLGory-file包FilField不允许提供自定义文件名,自动将文件保存为未命名

如何让程序打印新段落上的每一行?

如何从.cgi网站刮一张表到rame?

Pandas—合并数据帧,在公共列上保留非空值,在另一列上保留平均值

Pandas:将多级列名改为一级

如何从数据库上传数据到html?

Django REST Framework:无法正确地将值注释到多对多模型,不断得到错误字段名称字段对模型无效'<><>

Python脚本使用蓝牙运行在Windows 11与raspberry pi4

使用Python从URL下载Excel文件

实现神经网络代码时的TypeError

我的字符串搜索算法的平均时间复杂度和最坏时间复杂度是多少?

从旋转的DF查询非NaN值

将CSS链接到HTML文件的问题

我什么时候应该使用帆布和标签?

如何设置nan值为numpy数组多条件

Polars表达式无法访问中间列创建表达式

替换包含Python DataFrame中的值的<;

EST格式的Azure数据库笔记本中的当前时间戳