我正在try 从XHTML表中提取文本,作为纯文本,但保留显示为if the document were rendered in an HTML renderer的换行符.我不想在实际的原始XML文件中保留换行符.

原始表格单元格包含许多多余的空格,这些空格是HTML浏览器不会呈现的,还包含<p></p><br />标记(显然是are呈现的).

以下是源文档包含的单元格类型的示例:

<td>
  <p class="TableText10pts"><span style="FONT-SIZE: 11pt; COLOR: black">INTERPOLATION QUEUE FULL 
  </span><span style="FONT-SIZE: 11pt"></span></p>
  <p class="TableText10pts"><span style="FONT-SIZE: 11pt; COLOR: black">In 
  Interpolated position motion mode the set-point buffer is full. The last 
  received set-point is not interpolated.</span><span style="FONT-SIZE: 11pt"></span></p></td>

此单元格的提取文本应如下所示:

INTERPOLATION QUEUE FULL
In Interpolated position motion mode the set-point buffer is full. The last received set-point is not interpolated.

或者像这样(在段落之间多加一行):

INTERPOLATION QUEUE FULL

In Interpolated position motion mode the set-point buffer is full. The last received set-point is not interpolated.

当我使用BeautifulSoup的.get_text(separator=' ',strip=True)方法时,XMLwithin a text element中不会在浏览器中呈现的空格保留在输出中,如下所示:

INTERPOLATION QUEUE FULL In \n      Interpolated position motion mode the set-point buffer is full. The last \n      received set-point is not interpolated.

当我使用this question中更复杂的基于BeautifulSoup的答案时,许多不需要的空格消失了,但未呈现的换行符仍然存在,例如在"in"和"interpolated"之间.

当我在其默认设置中使用Html2Text时,未呈现的空格将按我所希望的那样被剥离,但底层HTML中存在的<p><br />标记将被忽略,并且它会插入在HTML段落中不存在的附加换行符.

我的Html2Text用法的代码片段:

h2t = html2text.HTML2Text()
h2t.ignore_emphasis=True

def element2html(element):
    return ET.tostring(element, encoding='unicode', method='xml')

def get_text(element):
    html = element2html(element)
    return h2t.handle(html).strip()

以上代码的输出示例:

INTERPOLATION QUEUE FULL  In Interpolated position motion mode the set-point\nbuffer is full. The last received set-point is not interpolated.

我可以通过将Html2Text转换器配置为BodyWidth=0来 suppress 换行符插入:

h2t = html2text.HTML2Text()
h2t.body_width=0
h2t.ignore_emphasis=True
[...]

但它仍在丢弃原始HTML中的<p><br />布局信息.示例输出:

INTERPOLATION QUEUE FULL  In Interpolated position motion mode the set-point buffer is full. The last received set-point is not interpolated.

如何像浏览器那样使用空格处理文本?

UPDATE: 下面是另一个来自源文档的示例XHTML的逐字示例.(这一次我没有省略<td>标记上的格式属性).

<td style="BORDER-TOP: medium none; HEIGHT: 13.5pt; BORDER-RIGHT: red 1pt solid; WIDTH: 205.55pt; BACKGROUND: white; BORDER-BOTTOM: red 1pt solid; PADDING-BOTTOM: 0pt; PADDING-TOP: 0pt; PADDING-LEFT: 5.4pt; BORDER-LEFT: red 1pt solid; PADDING-RIGHT: 5.4pt" valign="top" width="274">
  <p class="TableText10pts"><span style="FONT-SIZE: 11pt; COLOR: black">Motor 
  stuck - the motor is powered but is not moving according to the definition 
  of <b>CL[2]</b> and <b>CL[3].</b></span><span style="FONT-SIZE: 11pt"></span></p></td>

我希望提取的文本是这样的(没有换行符):

Motor stuck - the motor is powered but is not moving according to the definition of CL[2] and CL[3].

从输出中go 掉<b>个标记完全没有问题,但在此输入上运行text = [" ".join(p.getText(strip=True).replace("\n", "").split()) for p in soup] 也会删除<b>个标记周围的空格.

因此,实际输出如下所示:

Motor stuck - the motor is powered but is not moving according to the definition ofCL[2]andCL[3].

推荐答案

由于我不确定给定的HTML示例really是否包装了问题中的<p>,因此我的答案是an educated guess,但您可以try 如下所示的简单方法:

from bs4 import BeautifulSoup

sample_html = """<td>
  <p class="TableText10pts"><span style="FONT-SIZE: 11pt; COLOR: black">INTERPOLATION QUEUE FULL 
  </span><span style="FONT-SIZE: 11pt"></span></p>
  <p class="TableText10pts"><span style="FONT-SIZE: 11pt; COLOR: black">In Interpolated position motion mode the set-point buffer is full. The last received set-point is not interpolated.</span><span style="FONT-SIZE: 11pt"></span></p></td>"""

soup = BeautifulSoup(sample_html, 'html.parser').getText(strip=True, separator='\n')
print(soup)

这应该打印出来:

INTERPOLATION QUEUE FULL
In Interpolated position motion mode the set-point buffer is full. The last received set-point is not interpolated.

然而,如果样本是is actually spaced the way it is,那么,IMHO,你不需要任何花哨的模块.

例如,这是:

from bs4 import BeautifulSoup

sample_html = """<td>
  <p class="TableText10pts"><span style="FONT-SIZE: 11pt; COLOR: black">INTERPOLATION QUEUE FULL 
  </span><span style="FONT-SIZE: 11pt"></span></p>
  <p class="TableText10pts"><span style="FONT-SIZE: 11pt; COLOR: black">In 
  Interpolated position motion mode the set-point buffer is full. The last 
  received set-point is not interpolated.</span><span style="FONT-SIZE: 11pt"></span></p></td>"""

soup = BeautifulSoup(sample_html, 'html.parser').find_all("p")
text = [" ".join(p.getText().replace("\n", "").split()) for p in soup]
print("\n".join(text))

它提供了以下功能:

INTERPOLATION QUEUE FULL
In Interpolated position motion mode the set-point buffer is full. The last received set-point is not interpolated.

Python相关问答推荐

如何通过多2多字段过滤查询集

GL pygame无法让缓冲区与vertextPointer和colorPointer一起可靠地工作

Pandas 填充条件是另一列

Select 用a和i标签包裹的复选框?

try 在树叶 map 上应用覆盖磁贴

Gekko:Spring-Mass系统的参数识别

pandas滚动和窗口中有效观察的最大数量

如何在python polars中停止otherate(),当使用when()表达式时?

如何使用它?

在Python中动态计算范围

所有列的滚动标准差,忽略NaN

连接一个rabrame和另一个1d rabrame不是问题,但当使用[...]'运算符会产生不同的结果

Scrapy和Great Expectations(great_expectations)—不合作

如何使用两个关键函数来排序一个多索引框架?

下三角形掩码与seaborn clustermap bug

用两个字符串构建回文

查看pandas字符列是否在字符串列中

Pandas 数据帧中的枚举,不能在枚举列上执行GROUP BY吗?

numpy数组和数组标量之间的不同行为

如何获取给定列中包含特定值的行号?