Python 使用BeautifulSoup从另一个页面生成带有特定标签的HTML页面

发布于12月10日

我正在探索BeautifulSoup，目标是在HTML文件中只保留特定的标记来创建一个新的标记.

我可以通过下面的程序成功地实现这一点.但是，我相信可能会有一种更合适、更自然的方法，而不需要手动追加字符串.

from bs4 import BeautifulSoup
#soup = BeautifulSoup(page.content, 'html.parser')

with open('P:/Test.html', 'r') as f:
    contents = f.read()
    soup= BeautifulSoup(contents, 'html.parser')

NewHTML = "<html><body>"
NewHTML+="\n"+str(soup.find('title'))
NewHTML+="\n"+str(soup.find('p', attrs={'class': 'm-b-0'}))
NewHTML+="\n"+str(soup.find('div', attrs={'id' :'right-col'}))
NewHTML+= "</body></html>"

with open("output1.html", "w") as file:
    file.write(NewHTML)

推荐答案

您可以有一个所需标记的列表，遍历它们，并使用Beautiful Soup的append方法有 Select 地在新的HTML struct 中包含相应的元素.

from bs4 import BeautifulSoup

with open('Test.html', 'r') as f:
    contents = f.read()
    soup = BeautifulSoup(contents, 'html.parser')

new_html = BeautifulSoup("<html><body></body></html>", 'html.parser')

tags_to_keep = ['title', {'p': {'class': 'm-b-0'}}, {'div': {'id': 'right-col'}}]

# Iterate through the tags to keep and append them to the new HTML
for tag in tags_to_keep:
    # If the tag is a string, find it in the original HTML
    # and append it to the new HTML
    if isinstance(tag, str):
        new_html.body.append(soup.find(tag))
    # If the tag is a dictionary, extract tag name and attributes,
    # then find them in the original HTML and append them to the new HTML
    elif isinstance(tag, dict):
        tag_name = list(tag.keys())[0]
        tag_attrs = tag[tag_name]
        new_html.body.append(soup.find(tag_name, attrs=tag_attrs))

with open("output1.html", "w") as file:
    file.write(str(new_html))

假设你有一个像下面这样的HTML文档(为了重现性起见，这将有助于包含):

<!DOCTYPE html>
<head>
    <title>Test Page</title>
</head>
<body>
    <p class="m-b-0">Paragraph with class 'm-b-0'.</p>
    <div id="right-col">
        <p>Paragraph inside the 'right-col' div.</p>
    </div>
    <p>Paragraph outside the targeted tags.</p>
</body>
</html>

由此产生的output1.html将包含以下内容:

<html>
   <body>
      <title>Test Page</title>
      <p class="m-b-0">Paragraph with class 'm-b-0'.</p>
      <div id="right-col">
         <p>Paragraph inside the 'right-col' div.</p>
      </div>
   </body>
</html>