我有html文本,我正试图用"汤"清理.但是,我不仅需要确定哪些文本段包含在class=‘Highlight’的某些span元素中,而且还需要维护它们在文本中的顺序.
例如,下面是示例代码:
from bs4 import BeautifulSoup
import pandas as pd
original_string = """<div class="image-container half-saturation half-opaque" \
style="cursor: pointer;"><img src="../stim/microphone.png" style="width: 40px; height: 40px;">\
</div><p class="full-opaque">\
<span class="highlight">Easy to cultivate, sunflowers are a popular choice for gardeners of all skill levels</span>. \
Their large, <span class="highlight">cheerful blooms</span>\
bring a touch of summer to any outdoor space, creating a delightful atmosphere. \
Whether you're enjoying their beauty in a garden or using them to add a splash of color to your living space, \
sunflowers are a symbol of positivity and radiance, making them a beloved part of nature's tapestry.</p>"""
# Parse the HTML content
soup = BeautifulSoup(original_string, 'html.parser')
所需输出(在本例中有4个文本段):
data = {
'text_order': [0, 1, 2, 3],
'text': ["Easy to cultivate, sunflowers are a popular choice for gardeners of all skill levels",
"Their large, ", "cheerful blooms",
"bring a touch of summer to any outdoor space, creating a delightful atmosphere. Whether you're enjoying their beauty in a garden or using them to add a splash of color to your living space, sunflowers are a symbol of positivity and radiance, making them a beloved part of nature's tapestry."],
'highlight': [True, False, True, False]
}
df = pd.DataFrame(data)
print(df)
我try 使用"Highlight_spans=Soup.find_all(‘span’,class_=‘Highlight’)"提取SPAN文本,但这并不保持文本在段落中的显示顺序.