Python 从 HTML 中提取文本，像浏览器一样处理空格和
和
标签

发布于05月16日

我正在try 从XHTML表中提取文本，作为纯文本，但保留显示为if the document were rendered in an HTML renderer的换行符.我不想在实际的原始XML文件中保留换行符.

原始表格单元格包含许多多余的空格，这些空格是HTML浏览器不会呈现的，还包含和 标记(显然是are呈现的).

以下是源文档包含的单元格类型的示例:

<td>
  <p class="TableText10pts"><span style="FONT-SIZE: 11pt; COLOR: black">INTERPOLATION QUEUE FULL 
  </span><span style="FONT-SIZE: 11pt"></span></p>
  <p class="TableText10pts"><span style="FONT-SIZE: 11pt; COLOR: black">In 
  Interpolated position motion mode the set-point buffer is full. The last 
  received set-point is not interpolated.</span><span style="FONT-SIZE: 11pt"></span></p></td>

此单元格的提取文本应如下所示:

INTERPOLATION QUEUE FULL
In Interpolated position motion mode the set-point buffer is full. The last received set-point is not interpolated.

或者像这样(在段落之间多加一行):

INTERPOLATION QUEUE FULL

In Interpolated position motion mode the set-point buffer is full. The last received set-point is not interpolated.

当我使用BeautifulSoup的.get_text(separator=' ',strip=True)方法时，XMLwithin a text element中不会在浏览器中呈现的空格保留在输出中，如下所示:

INTERPOLATION QUEUE FULL In \n      Interpolated position motion mode the set-point buffer is full. The last \n      received set-point is not interpolated.

当我使用this question中更复杂的基于BeautifulSoup的答案时，许多不需要的空格消失了，但未呈现的换行符仍然存在，例如在"in"和"interpolated"之间.

当我在其默认设置中使用Html2Text时，未呈现的空格将按我所希望的那样被剥离，但底层HTML中存在的和 标记将被忽略，并且它会插入在HTML段落中不存在的附加换行符.

我的Html2Text用法的代码片段:

h2t = html2text.HTML2Text()
h2t.ignore_emphasis=True

def element2html(element):
    return ET.tostring(element, encoding='unicode', method='xml')

def get_text(element):
    html = element2html(element)
    return h2t.handle(html).strip()

以上代码的输出示例:

INTERPOLATION QUEUE FULL  In Interpolated position motion mode the set-point\nbuffer is full. The last received set-point is not interpolated.

我可以通过将Html2Text转换器配置为BodyWidth=0来 suppress 换行符插入:

h2t = html2text.HTML2Text()
h2t.body_width=0
h2t.ignore_emphasis=True
[...]

但它仍在丢弃原始HTML中的和 布局信息.示例输出:

INTERPOLATION QUEUE FULL  In Interpolated position motion mode the set-point buffer is full. The last received set-point is not interpolated.

如何像浏览器那样使用空格处理文本？

UPDATE: 下面是另一个来自源文档的示例XHTML的逐字示例.(这一次我没有省略<td>标记上的格式属性).

<td style="BORDER-TOP: medium none; HEIGHT: 13.5pt; BORDER-RIGHT: red 1pt solid; WIDTH: 205.55pt; BACKGROUND: white; BORDER-BOTTOM: red 1pt solid; PADDING-BOTTOM: 0pt; PADDING-TOP: 0pt; PADDING-LEFT: 5.4pt; BORDER-LEFT: red 1pt solid; PADDING-RIGHT: 5.4pt" valign="top" width="274">
  <p class="TableText10pts"><span style="FONT-SIZE: 11pt; COLOR: black">Motor 
  stuck - the motor is powered but is not moving according to the definition 
  of <b>CL[2]</b> and <b>CL[3].</b></span><span style="FONT-SIZE: 11pt"></span></p></td>

我希望提取的文本是这样的(没有换行符):

Motor stuck - the motor is powered but is not moving according to the definition of CL[2] and CL[3].

从输出中go 掉个标记完全没有问题，但在此输入上运行text = [" ".join(p.getText(strip=True).replace("\n", "").split()) for p in soup] 也会删除个标记周围的空格.

因此，实际输出如下所示:

Motor stuck - the motor is powered but is not moving according to the definition ofCL[2]andCL[3].

from bs4 import BeautifulSoup sample_html = """<td> INTERPOLATION QUEUE FULL In Interpolated position motion mode the set-point buffer is full. The last received set-point is not interpolated.</td>""" soup = BeautifulSoup(sample_html, 'html.parser').getText(strip=True, separator='\n') print(soup)

Python 从 HTML 中提取文本，像浏览器一样处理空格和
和
标签

推荐答案

Python相关问答推荐

如何通过多2多字段过滤查询集

GL pygame无法让缓冲区与vertextPointer和colorPointer一起可靠地工作

Pandas 填充条件是另一列

Select 用a和i标签包裹的复选框？

try 在树叶 map 上应用覆盖磁贴

Gekko：Spring-Mass系统的参数识别

pandas滚动和窗口中有效观察的最大数量

如何在python polars中停止otherate()，当使用when()表达式时？

如何使用它？

在Python中动态计算范围

所有列的滚动标准差，忽略NaN

连接一个rabrame和另一个1d rabrame不是问题，但当使用[...]'运算符会产生不同的结果

Scrapy和Great Expectations(great_expectations)—不合作

如何使用两个关键函数来排序一个多索引框架？

下三角形掩码与seaborn clustermap bug

用两个字符串构建回文

查看pandas字符列是否在字符串列中

Pandas 数据帧中的枚举，不能在枚举列上执行GROUP BY吗？

numpy数组和数组标量之间的不同行为

如何获取给定列中包含特定值的行号？