使用Python BeautifulSoup识别带有SPAN元素的文本

发布于12月01日

我有html文本，我正试图用"汤"清理.但是，我不仅需要确定哪些文本段包含在class=‘Highlight’的某些span元素中，而且还需要维护它们在文本中的顺序.

例如，下面是示例代码:

from bs4 import BeautifulSoup
import pandas as pd

original_string = """<div class="image-container half-saturation half-opaque" \
style="cursor: pointer;"><img src="../stim/microphone.png" style="width: 40px; height: 40px;">\
</div><p class="full-opaque">\
<span class="highlight">Easy to cultivate, sunflowers are a popular choice for gardeners of all skill levels</span>. \
Their large, <span class="highlight">cheerful blooms</span>\
bring a touch of summer to any outdoor space, creating a delightful atmosphere. \
Whether you're enjoying their beauty in a garden or using them to add a splash of color to your living space, \
sunflowers are a symbol of positivity and radiance, making them a beloved part of nature's tapestry.</p>"""

# Parse the HTML content
soup = BeautifulSoup(original_string, 'html.parser')

所需输出(在本例中有4个文本段):

data = {
    'text_order': [0, 1, 2, 3],
    'text': ["Easy to cultivate, sunflowers are a popular choice for gardeners of all skill levels",
             "Their large, ", "cheerful blooms", 
             "bring a touch of summer to any outdoor space, creating a delightful atmosphere. Whether you're enjoying their beauty in a garden or using them to add a splash of color to your living space, sunflowers are a symbol of positivity and radiance, making them a beloved part of nature's tapestry."],
    'highlight': [True, False, True, False]
}

df = pd.DataFrame(data)
print(df)

我try 使用"Highlight_spans=Soup.find_all(‘span’，class_=‘Highlight’)"提取SPAN文本，但这并不保持文本在段落中的显示顺序.

import pandas as pd from bs4 import BeautifulSoup original_string = """<div class="image-container half-saturation half-opaque" \ style="cursor: pointer;"><img src="../stim/microphone.png" style="width: 40px; height: 40px;">\ </div><p class="full-opaque">\ <span class="highlight">Easy to cultivate, sunflowers are a popular choice for gardeners of all skill levels</span>. \ Their large, <span class="highlight">cheerful blooms</span>\ bring a touch of summer to any outdoor space, creating a delightful atmosphere. \ Whether you're enjoying their beauty in a garden or using them to add a splash of color to your living space, \ sunflowers are a symbol of positivity and radiance, making them a beloved part of nature's tapestry.</p>""" # Parse the HTML content soup = BeautifulSoup(original_string, "html.parser") data = [] for i, text in enumerate(soup.p.find_all(string=True)): data.append( { "text_order": i, "text": text.strip(), "highlight": bool(text.find_parent(class_="highlight")), } ) df = pd.DataFrame(data) print(df)

text_order text highlight 0 0 Easy to cultivate, sunflowers are a popular choice for gardeners of all skill levels True 1 1 . Their large, False 2 2 cheerful blooms True 3 3 bring a touch of summer to any outdoor space, creating a delightful atmosphere. Whether you're enjoying their beauty in a garden or using them to add a splash of color to your living space, sunflowers are a symbol of positivity and radiance, making them a beloved part of nature's tapestry. False

使用Python BeautifulSoup识别带有SPAN元素的文本

推荐答案

Python相关问答推荐

删除pandas rame时间序列列中未更改的值

Numpy索引argsorted使用integer数组，同时保留排序顺序

Polars Dataframe：如何按组删除交替行？

更改Seaborn条形图中的x轴日期时间限制

不允许AMBIMA API请求方法

ModuleNotFound错误：没有名为Crypto Windows 11、Python 3.11.6的模块

需要计算60，000个坐标之间的距离

如何避免Chained when/then分配中的Mypy不兼容类型警告？

优化器的运行顺序影响PyTorch中的预测

改进大型数据集的框架性能

字符串合并语法在哪里记录

Django RawSQL注释字段

Plotly Dash Creating Interactive Graph下拉列表

幂集，其中每个元素可以是正或负""""

Python—压缩叶 map html作为邮箱附件并通过sendgrid发送

Pandas—MultiIndex Resample—我不想丢失其他索引的信息´

不允许 Select 北极滚动？

查看pandas字符列是否在字符串列中

将CSS链接到HTML文件的问题

Js的查询结果可以在PC Chrome上显示，但不能在Android Chrome、OPERA和EDGE上显示，而两者都可以在Firefox上运行