我在和Parsel一起工作.不幸的是,我无法解析<a>
标记,它是另一个<a>
标记的子标记(我知道,<a>
中的<a>
不是HTML
标准).通过Parsel
我该如何处理这种情况?我已经用Beautiful Soup
+html.parser
作为后端解决了这个问题(Beatufiul Soup
+lxml
不起作用).
from parsel import Selector
html_text = '''
<html>
<head>
<base href='http://example.com/' />
<title>Example website</title>
</head>
<body>
<a href="#">
<a id="test" href='image1.html'>Name: My image 1 <br /></a>
<a id="test" href='image2.html'>Name: My image 2 <br /></a>
<a id="test" href='image3.html'>Name: My image 3 <br /></a>
<a id="test" href='image4.html'>Name: My image 4 <br /></a>
<a id="test" href='image5.html'>Name: My image 5 <br /></a>
</a>
</body>
</html>
'''
selector = Selector(text=html_text)
print(selector.xpath('//a/a')) # `<class 'parsel.selector.SelectorList'>` is an empty...
如果我把<a>
放在<div>
里面,一切都很好.下面是一个例子:
from parsel import Selector
html_text = '''
<html>
<head>
<base href='http://example.com/' />
<title>Example website</title>
</head>
<body>
<div>
<a id="test" href='image1.html'>Name: My image 1 <br /></a>
<a id="test" href='image2.html'>Name: My image 2 <br /></a>
<a id="test" href='image3.html'>Name: My image 3 <br /></a>
<a id="test" href='image4.html'>Name: My image 4 <br /></a>
<a id="test" href='image5.html'>Name: My image 5 <br /></a>
</div>
</body>
</html>
'''
selector = Selector(text=html_text)
print(selector.xpath('//div/a')) # <class 'parsel.selector.SelectorList'> is not empty...