我有html文本,我正试图用"汤"清理.但是,我不仅需要确定哪些文本段包含在class=‘Highlight’的某些span元素中,而且还需要维护它们在文本中的顺序.

例如,下面是示例代码:

from bs4 import BeautifulSoup
import pandas as pd

original_string = """<div class="image-container half-saturation half-opaque" \
style="cursor: pointer;"><img src="../stim/microphone.png" style="width: 40px; height: 40px;">\
</div><p class="full-opaque">\
<span class="highlight">Easy to cultivate, sunflowers are a popular choice for gardeners of all skill levels</span>. \
Their large, <span class="highlight">cheerful blooms</span>\
bring a touch of summer to any outdoor space, creating a delightful atmosphere. \
Whether you're enjoying their beauty in a garden or using them to add a splash of color to your living space, \
sunflowers are a symbol of positivity and radiance, making them a beloved part of nature's tapestry.</p>"""

# Parse the HTML content
soup = BeautifulSoup(original_string, 'html.parser')

所需输出(在本例中有4个文本段):

data = {
    'text_order': [0, 1, 2, 3],
    'text': ["Easy to cultivate, sunflowers are a popular choice for gardeners of all skill levels",
             "Their large, ", "cheerful blooms", 
             "bring a touch of summer to any outdoor space, creating a delightful atmosphere. Whether you're enjoying their beauty in a garden or using them to add a splash of color to your living space, sunflowers are a symbol of positivity and radiance, making them a beloved part of nature's tapestry."],
    'highlight': [True, False, True, False]
}

df = pd.DataFrame(data)
print(df)

我try 使用"Highlight_spans=Soup.find_all(‘span’,class_=‘Highlight’)"提取SPAN文本,但这并不保持文本在段落中的显示顺序.

推荐答案

try :

import pandas as pd
from bs4 import BeautifulSoup

original_string = """<div class="image-container half-saturation half-opaque" \
style="cursor: pointer;"><img src="../stim/microphone.png" style="width: 40px; height: 40px;">\
</div><p class="full-opaque">\
<span class="highlight">Easy to cultivate, sunflowers are a popular choice for gardeners of all skill levels</span>. \
Their large, <span class="highlight">cheerful blooms</span>\
bring a touch of summer to any outdoor space, creating a delightful atmosphere. \
Whether you're enjoying their beauty in a garden or using them to add a splash of color to your living space, \
sunflowers are a symbol of positivity and radiance, making them a beloved part of nature's tapestry.</p>"""

# Parse the HTML content
soup = BeautifulSoup(original_string, "html.parser")

data = []
for i, text in enumerate(soup.p.find_all(string=True)):
    data.append(
        {
            "text_order": i,
            "text": text.strip(),
            "highlight": bool(text.find_parent(class_="highlight")),
        }
    )

df = pd.DataFrame(data)
print(df)

打印:

   text_order                                                                                                                                                                                                                                                                                                text  highlight
0           0                                                                                                                                                                                                                Easy to cultivate, sunflowers are a popular choice for gardeners of all skill levels       True
1           1                                                                                                                                                                                                                                                                                      . Their large,      False
2           2                                                                                                                                                                                                                                                                                     cheerful blooms       True
3           3  bring a touch of summer to any outdoor space, creating a delightful atmosphere. Whether you're enjoying their beauty in a garden or using them to add a splash of color to your living space, sunflowers are a symbol of positivity and radiance, making them a beloved part of nature's tapestry.      False

Python相关问答推荐

删除pandas rame时间序列列中未更改的值

Numpy索引argsorted使用integer数组,同时保留排序顺序

Polars Dataframe:如何按组删除交替行?

更改Seaborn条形图中的x轴日期时间限制

不允许AMBIMA API请求方法

ModuleNotFound错误:没有名为Crypto Windows 11、Python 3.11.6的模块

需要计算60,000个坐标之间的距离

如何避免Chained when/then分配中的Mypy不兼容类型警告?

优化器的运行顺序影响PyTorch中的预测

改进大型数据集的框架性能

字符串合并语法在哪里记录

Django RawSQL注释字段

Plotly Dash Creating Interactive Graph下拉列表

幂集,其中每个元素可以是正或负""""

Python—压缩叶 map html作为邮箱附件并通过sendgrid发送

Pandas—MultiIndex Resample—我不想丢失其他索引的信息´

不允许 Select 北极滚动?

查看pandas字符列是否在字符串列中

将CSS链接到HTML文件的问题

Js的查询结果可以在PC Chrome上显示,但不能在Android Chrome、OPERA和EDGE上显示,而两者都可以在Firefox上运行