Python BeautifulSoup 的多个标签

发布于08月29日

import os
from bs4 import BeautifulSoup

# Get a list of all .htm files in the HTML_bak folder
html_files = [file for file in os.listdir('HTML_bak') if file.endswith('.htm')]

# Loop through each HTML file
for file_name in html_files:
    input_file_path = os.path.join('HTML_bak', file_name)
    output_file_path = os.path.join('HTML', file_name)
    
    # Read the input file with errors='ignore'
    with open(input_file_path, 'r', encoding='utf-8', errors='ignore') as input_file:
        input_content = input_file.read()

    # Parse the input content using BeautifulSoup with html5lib parser
    soup = BeautifulSoup(input_content, 'html5lib')

    main_content = soup.find('div', style='position:initial;float:left;text-align:left;overflow-wrap:break-word !important;width:98%;margin-left:5px;background-color:#FFFFFF;color:black;')
  
    # Overwrite the output file with modified content
    with open(output_file_path, 'w', encoding='utf-8') as output_file:
        output_file.write(str(main_content))

这段代码正确地扫描文件夹中的HTML文件，并且仅根据style拉入所需的div.但是，有时我想删除这div个标签中的一些标签.这些标记显示为:

<div class="gmail_quote">2010/2/11 some text here .... </div>个

我如何编辑我的代码，以便也删除这些gmail_quote类的标签？

Update 8/29/23:个

我正在复制一个示例HTML内容，以确保我的问题是清楚的.我想保留<body bgColor=#ffffff>之后的<div style="position:initial....的内容，删除<div class="gmail_quote">2010/2/11 ...的内容



<html><body style="background-color:#FFFFFF;"><div></div></body></html><article style="width:100%;float:left; position:left;background-color:#FFFFFF; margin: 0mm 0mm 0mm 0mm; "><style>
@media print {
pre { overflow-x:break-word; white-space:pre; white-space:hp-pre-wrap; white-space:-moz-pre-wrap; white-space:-o-pre-wrap;  white-space:-pre-wrap; white-space:pre-wrap; word-wrap:break-word;}
}pre { overflow-x:break-word; white-space:pre; white-space:hp-pre-wrap; white-space:-moz-pre-wrap; white-space:-o-pre-wrap;  white-space:-pre-wrap; white-space:pre-wrap; word-wrap:break-word;}
@page {size: auto; margin: 12mm 4mm 12mm 6mm; }
</style>
<div style="position:initial;float:left;background-color:transparent;text-align:left;width:100%;margin-left:5px;">
<html><head><meta http-equiv="Content-Type" content="text/html;charset=UTF-8;"><style>
.hdrfldname{color:black;font-size:20px; line-height:120%;}
.hdrfldtext{overflow-wrap:break-word;color:black;font-size:20px;line-height:120%;}
</style></head>
<body bgColor=#ffffff>
<div style="position:initial;float:left;text-align:left;font-weight:normal;width:100%;background-color:#eee9e9;">
<span class='hdrfldname'>SUBJECT: </span><span class='hdrfldtext'>lorem ipsum</span><br>
<span class='hdrfldname'>FROM: </span><span class='hdrfldtext'>lorem ipsum</span><br>
<span class='hdrfldname'>TO: </span><span class='hdrfldtext'>lorem ipsum</span><br>
<span class='hdrfldname'>DATE: </span><span class='hdrfldtext'>2010/02/12 09:10</span><br>
</div></body></html>
</div>
<div style="position:initial;float:left;text-align:left;overflow-wrap:break-word !important;width:98%;margin-left:5px;background-color:#FFFFFF;color:black;"><br>
<html><head><meta http-equiv="Content-Type" content="text/html;charset=UTF-8;"><style>
pre { overflow-x:break-word; white-space:pre; white-space:hp-pre-wrap; white-space:-moz-pre-wrap; white-space:-o-pre-wrap;  white-space:-pre-wrap; white-space:pre-wrap; word-wrap:break-word;}
</style></head><body bgColor=#ffffff>
<div> lorem ipsum </div>

<div class="gmail_quote">2010/2/11 lorem ipsum<span dir="ltr">&lt;<a  style="max-width:100%;" href="lorem ipsum">lorem ipsum</a>&gt;</span><br>

</body></html>
</div>
</article>
<div>&nbsp;<br></div>

Python BeautifulSoup 的多个标签

推荐答案

Python相关问答推荐

如何在具有重复数据的pandas中对groupby进行总和，同时保留其他列

根据条件将新值添加到下面的行或下面新创建的行中

使可滚动框架在tkinter环境中看起来自然

如何制作10，000年及以后的日期时间对象？

Mistral模型为不同的输入文本生成相同的嵌入

我们可以为Flask模型中的id字段主键设置默认uuid吗

Streamlit应用程序中的Plotly条形图中未正确显示Y轴刻度

从spaCy的句子中提取日期

使用NeuralProphet绘制置信区间时出错

如何更改groupby作用域以找到满足掩码条件的第一个值？

如何在达到end_time时自动将状态字段从1更改为0

如何使用两个关键函数来排序一个多索引框架？

在输入行运行时停止代码

如何创建引用列表并分配值的Systemrame列

将CSS链接到HTML文件的问题

多个矩阵的张量积

Python OPCUA，modbus通信代码运行3小时后出现RuntimeError

是否将Pandas 数据帧标题/标题以纯文本格式转换为字符串输出？

为什么在安装了64位Python的64位Windows 10上以32位运行？

Django查询集-排除True值