我有一个HTML文件,它的底部包含XML并用注释括起来,它看起来如下所示:
<!DOCTYPE html>
<html>
<head>
***
</head>
<body>
<div class="panel panel-primary call__report-modal-panel">
<div class="panel-heading text-center custom-panel-heading">
<h2>Report</h2>
</div>
<div class="panel-body">
<div class="panel panel-default">
<div class="panel-heading">
<div class="panel-title">Info</div>
</div>
<div class="panel-body">
<table class="table table-bordered table-page-break-auto table-layout-fixed">
<tr>
<td class="col-sm-4">ID</td>
<td class="col-sm-8">1</td>
</tr>
</table>
</div>
</div>
</body>
</html>
<!--<?xml version = "1.0" encoding="Windows-1252" standalone="yes"?>
<ROOTTAG>
<mytag>
<headername>BASE</headername>
<fieldname>NAME</fieldname>
<val><![CDATA[Testcase]]></val>
</mytag>
<mytag>
<headername>BASE</headername>
<fieldname>AGE</fieldname>
<val><![CDATA[5]]></val>
</mytag>
</ROOTTAG>
-->
要求是解析上面的HTML中的注释中的XML. 到目前为止,我已经try 读取该HTML文件并将其传递给一个字符串,并执行了以下操作:
with open('my_html.html', 'rb') as file:
d = str(file.read())
d2 = d[d.index('<!--') + 4:d.index('-->')]
d3 = "'''"+d2+"'''"
这是在字符串d3中返回具有3个单qoute的XML数据片段.
然后试着通过Etree阅读它:
ET.fromstring(d3)
但它失败了,错误如下:
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 2个
基本上需要一些帮助来:
- 阅读Html
- 取出带有在HTML底部注释的XML片段的代码片段
- 获取该字符串并将其传递给ET.from字符串()函数,但由于该函数获取带有三重qoutes字符串,因此没有正确格式化它,从而引发错误