您可以有一个所需标记的列表,遍历它们,并使用Beautiful Soup的append方法有 Select 地在新的HTML struct 中包含相应的元素.
from bs4 import BeautifulSoup
with open('Test.html', 'r') as f:
contents = f.read()
soup = BeautifulSoup(contents, 'html.parser')
new_html = BeautifulSoup("<html><body></body></html>", 'html.parser')
tags_to_keep = ['title', {'p': {'class': 'm-b-0'}}, {'div': {'id': 'right-col'}}]
# Iterate through the tags to keep and append them to the new HTML
for tag in tags_to_keep:
# If the tag is a string, find it in the original HTML
# and append it to the new HTML
if isinstance(tag, str):
new_html.body.append(soup.find(tag))
# If the tag is a dictionary, extract tag name and attributes,
# then find them in the original HTML and append them to the new HTML
elif isinstance(tag, dict):
tag_name = list(tag.keys())[0]
tag_attrs = tag[tag_name]
new_html.body.append(soup.find(tag_name, attrs=tag_attrs))
with open("output1.html", "w") as file:
file.write(str(new_html))
假设你有一个像下面这样的HTML文档(为了重现性起见,这将有助于包含):
<!DOCTYPE html>
<head>
<title>Test Page</title>
</head>
<body>
<p class="m-b-0">Paragraph with class 'm-b-0'.</p>
<div id="right-col">
<p>Paragraph inside the 'right-col' div.</p>
</div>
<p>Paragraph outside the targeted tags.</p>
</body>
</html>
由此产生的output1.html
将包含以下内容:
<html>
<body>
<title>Test Page</title>
<p class="m-b-0">Paragraph with class 'm-b-0'.</p>
<div id="right-col">
<p>Paragraph inside the 'right-col' div.</p>
</div>
</body>
</html>