我正在try 从XHTML表中提取文本,作为纯文本,但保留显示为if the document were rendered in an HTML renderer的换行符.我不想在实际的原始XML文件中保留换行符.
原始表格单元格包含许多多余的空格,这些空格是HTML浏览器不会呈现的,还包含<p></p>
和<br />
标记(显然是are呈现的).
以下是源文档包含的单元格类型的示例:
<td>
<p class="TableText10pts"><span style="FONT-SIZE: 11pt; COLOR: black">INTERPOLATION QUEUE FULL
</span><span style="FONT-SIZE: 11pt"></span></p>
<p class="TableText10pts"><span style="FONT-SIZE: 11pt; COLOR: black">In
Interpolated position motion mode the set-point buffer is full. The last
received set-point is not interpolated.</span><span style="FONT-SIZE: 11pt"></span></p></td>
此单元格的提取文本应如下所示:
INTERPOLATION QUEUE FULL In Interpolated position motion mode the set-point buffer is full. The last received set-point is not interpolated.
或者像这样(在段落之间多加一行):
INTERPOLATION QUEUE FULL In Interpolated position motion mode the set-point buffer is full. The last received set-point is not interpolated.
当我使用BeautifulSoup的.get_text(separator=' ',strip=True)
方法时,XMLwithin a text element中不会在浏览器中呈现的空格保留在输出中,如下所示:
INTERPOLATION QUEUE FULL In \n Interpolated position motion mode the set-point buffer is full. The last \n received set-point is not interpolated.
当我使用this question中更复杂的基于BeautifulSoup的答案时,许多不需要的空格消失了,但未呈现的换行符仍然存在,例如在"in"和"interpolated"之间.
当我在其默认设置中使用Html2Text时,未呈现的空格将按我所希望的那样被剥离,但底层HTML中存在的<p>
和<br />
标记将被忽略,并且它会插入在HTML段落中不存在的附加换行符.
我的Html2Text用法的代码片段:
h2t = html2text.HTML2Text()
h2t.ignore_emphasis=True
def element2html(element):
return ET.tostring(element, encoding='unicode', method='xml')
def get_text(element):
html = element2html(element)
return h2t.handle(html).strip()
以上代码的输出示例:
INTERPOLATION QUEUE FULL In Interpolated position motion mode the set-point\nbuffer is full. The last received set-point is not interpolated.
我可以通过将Html2Text转换器配置为BodyWidth=0来 suppress 换行符插入:
h2t = html2text.HTML2Text()
h2t.body_width=0
h2t.ignore_emphasis=True
[...]
但它仍在丢弃原始HTML中的<p>
和<br />
布局信息.示例输出:
INTERPOLATION QUEUE FULL In Interpolated position motion mode the set-point buffer is full. The last received set-point is not interpolated.
如何像浏览器那样使用空格处理文本?
UPDATE:
下面是另一个来自源文档的示例XHTML的逐字示例.(这一次我没有省略<td>
标记上的格式属性).
<td style="BORDER-TOP: medium none; HEIGHT: 13.5pt; BORDER-RIGHT: red 1pt solid; WIDTH: 205.55pt; BACKGROUND: white; BORDER-BOTTOM: red 1pt solid; PADDING-BOTTOM: 0pt; PADDING-TOP: 0pt; PADDING-LEFT: 5.4pt; BORDER-LEFT: red 1pt solid; PADDING-RIGHT: 5.4pt" valign="top" width="274">
<p class="TableText10pts"><span style="FONT-SIZE: 11pt; COLOR: black">Motor
stuck - the motor is powered but is not moving according to the definition
of <b>CL[2]</b> and <b>CL[3].</b></span><span style="FONT-SIZE: 11pt"></span></p></td>
我希望提取的文本是这样的(没有换行符):
Motor stuck - the motor is powered but is not moving according to the definition of CL[2] and CL[3].
从输出中go 掉<b>
个标记完全没有问题,但在此输入上运行text = [" ".join(p.getText(strip=True).replace("\n", "").split()) for p in soup]
也会删除<b>
个标记周围的空格.
因此,实际输出如下所示:
Motor stuck - the motor is powered but is not moving according to the definition ofCL[2]andCL[3].