import os
from bs4 import BeautifulSoup
# Get a list of all .htm files in the HTML_bak folder
html_files = [file for file in os.listdir('HTML_bak') if file.endswith('.htm')]
# Loop through each HTML file
for file_name in html_files:
input_file_path = os.path.join('HTML_bak', file_name)
output_file_path = os.path.join('HTML', file_name)
# Read the input file with errors='ignore'
with open(input_file_path, 'r', encoding='utf-8', errors='ignore') as input_file:
input_content = input_file.read()
# Parse the input content using BeautifulSoup with html5lib parser
soup = BeautifulSoup(input_content, 'html5lib')
main_content = soup.find('div', style='position:initial;float:left;text-align:left;overflow-wrap:break-word !important;width:98%;margin-left:5px;background-color:#FFFFFF;color:black;')
# Overwrite the output file with modified content
with open(output_file_path, 'w', encoding='utf-8') as output_file:
output_file.write(str(main_content))
这段代码正确地扫描文件夹中的HTML文件,并且仅根据style
拉入所需的div
.但是,有时我想删除这div
个标签中的一些标签.这些标记显示为:
<div class="gmail_quote">2010/2/11 some text here .... </div>
个
我如何编辑我的代码,以便也删除这些gmail_quote
类的标签?
Update 8/29/23:个
我正在复制一个示例HTML内容,以确保我的问题是清楚的.我想保留<body bgColor=#ffffff>
之后的<div style="position:initial....
的内容,删除<div class="gmail_quote">2010/2/11 ...
的内容
<html><body style="background-color:#FFFFFF;"><div></div></body></html><article style="width:100%;float:left; position:left;background-color:#FFFFFF; margin: 0mm 0mm 0mm 0mm; "><style>
@media print {
pre { overflow-x:break-word; white-space:pre; white-space:hp-pre-wrap; white-space:-moz-pre-wrap; white-space:-o-pre-wrap; white-space:-pre-wrap; white-space:pre-wrap; word-wrap:break-word;}
}pre { overflow-x:break-word; white-space:pre; white-space:hp-pre-wrap; white-space:-moz-pre-wrap; white-space:-o-pre-wrap; white-space:-pre-wrap; white-space:pre-wrap; word-wrap:break-word;}
@page {size: auto; margin: 12mm 4mm 12mm 6mm; }
</style>
<div style="position:initial;float:left;background-color:transparent;text-align:left;width:100%;margin-left:5px;">
<html><head><meta http-equiv="Content-Type" content="text/html;charset=UTF-8;"><style>
.hdrfldname{color:black;font-size:20px; line-height:120%;}
.hdrfldtext{overflow-wrap:break-word;color:black;font-size:20px;line-height:120%;}
</style></head>
<body bgColor=#ffffff>
<div style="position:initial;float:left;text-align:left;font-weight:normal;width:100%;background-color:#eee9e9;">
<span class='hdrfldname'>SUBJECT: </span><span class='hdrfldtext'>lorem ipsum</span><br>
<span class='hdrfldname'>FROM: </span><span class='hdrfldtext'>lorem ipsum</span><br>
<span class='hdrfldname'>TO: </span><span class='hdrfldtext'>lorem ipsum</span><br>
<span class='hdrfldname'>DATE: </span><span class='hdrfldtext'>2010/02/12 09:10</span><br>
</div></body></html>
</div>
<div style="position:initial;float:left;text-align:left;overflow-wrap:break-word !important;width:98%;margin-left:5px;background-color:#FFFFFF;color:black;"><br>
<html><head><meta http-equiv="Content-Type" content="text/html;charset=UTF-8;"><style>
pre { overflow-x:break-word; white-space:pre; white-space:hp-pre-wrap; white-space:-moz-pre-wrap; white-space:-o-pre-wrap; white-space:-pre-wrap; white-space:pre-wrap; word-wrap:break-word;}
</style></head><body bgColor=#ffffff>
<div> lorem ipsum </div>
<div class="gmail_quote">2010/2/11 lorem ipsum<span dir="ltr"><<a style="max-width:100%;" href="lorem ipsum">lorem ipsum</a>></span><br>
</body></html>
</div>
</article>
<div> <br></div>