Python Parsel无法访问嵌套元素

发布于03月03日

我在和Parsel一起工作.不幸的是，我无法解析<a>标记，它是另一个<a>标记的子标记(我知道，<a>中的<a>不是HTML标准).通过Parsel我该如何处理这种情况？我已经用Beautiful Soup+html.parser作为后端解决了这个问题(Beatufiul Soup+lxml不起作用).

from parsel import Selector

html_text = '''
<html>
    <head>
    <base href='http://example.com/' />
    <title>Example website</title>
    </head>
    <body>
    <a href="#">
        <a id="test" href='image1.html'>Name: My image 1 <br /></a>
        <a id="test" href='image2.html'>Name: My image 2 <br /></a>
        <a id="test" href='image3.html'>Name: My image 3 <br /></a>
        <a id="test" href='image4.html'>Name: My image 4 <br /></a>
        <a id="test" href='image5.html'>Name: My image 5 <br /></a>
    </a>
    </body>
    </html>
'''

selector = Selector(text=html_text)
print(selector.xpath('//a/a')) # `<class 'parsel.selector.SelectorList'>` is an empty...

如果我把<a>放在<div>里面，一切都很好.下面是一个例子:

from parsel import Selector

html_text = '''
<html>
    <head>
    <base href='http://example.com/' />
    <title>Example website</title>
    </head>
    <body>
    <div>
        <a id="test" href='image1.html'>Name: My image 1 <br /></a>
        <a id="test" href='image2.html'>Name: My image 2 <br /></a>
        <a id="test" href='image3.html'>Name: My image 3 <br /></a>
        <a id="test" href='image4.html'>Name: My image 4 <br /></a>
        <a id="test" href='image5.html'>Name: My image 5 <br /></a>
    </div>
    </body>
    </html>
'''

selector = Selector(text=html_text)
print(selector.xpath('//div/a')) # <class 'parsel.selector.SelectorList'> is not empty...

from parsel import Selector html_text = """ <html> <head> <base href='http://example.com/' /> <title>Example website</title> </head> <body> <a href="#"> <a id="test" href='image1.html'>Name: My image 1 </a> <a id="test" href='image2.html'>Name: My image 2 </a> <a id="test" href='image3.html'>Name: My image 3 </a> <a id="test" href='image4.html'>Name: My image 4 </a> <a id="test" href='image5.html'>Name: My image 5 </a> </a> </body> </html> """ selector = Selector(text=html_text, type="xml") # print how the Parsel parses the document: # print(selector.getall()[0]) print(selector.xpath("//a/a"))

[ <Selector query='//a/a' data='<a id="test" href="image1.html">Name:...'>, <Selector query='//a/a' data='<a id="test" href="image2.html">Name:...'>, <Selector query='//a/a' data='<a id="test" href="image3.html">Name:...'>, <Selector query='//a/a' data='<a id="test" href="image4.html">Name:...'>, <Selector query='//a/a' data='<a id="test" href="image5.html">Name:...'> ]

Python Parsel无法访问嵌套元素

推荐答案

Python相关问答推荐

使用SciPy进行曲线匹配未能给出正确的匹配

对Numpy函数进行载体化

即使在可见的情况下也不相互作用

点到面的Y距离

如何使用Python将工作表从一个Excel工作簿复制粘贴到另一个工作簿？

通过Selenium从页面获取所有H2元素

' osmnx.shortest_track '返回有效源 node 和目标 node 的'无'

根据二元组列表在pandas中创建新列

所有列的滚动标准差，忽略NaN

改进大型数据集的框架性能

如何从列表框中 Select 而不出错？

如何排除prefecture_related中查询集为空的实例？

剪切间隔以添加特定日期

搜索按钮不工作，Python tkinter

BeautifulSoup：超过24个字符(从a到z)的迭代失败：降低了首次深入了解数据集的复杂性：

将链中的矩阵乘法应用于多组值

为什么在Python中00是一个有效的整数？

如何在Quarto中的标题页之前创建序言页

将Pandas DataFrame中的列名的长文本打断/换行为_STRING输出？

按最大属性值Django对对象进行排序