from bs4 import BeautifulSoup
html = '''<tbody id="plaintiff-body">
<tr>
<td><img id="plaimg0001" src="/CaseInformationOnline/images/minus.png" onclick="showhide('pladetail0001','','plaimg0001')"></td>
<td>JENEE BENNETT</td>
<td></td>
<td>COURTNEY L HANNA</td>
</tr>
<tr id="pladetail0001" style="" valign="top">
<td></td>
<td>2348 WOODBROOK CIR N<br>UNIT D<br>COLUMBUS, OH 43223</td>
<td></td>
<td>JOSEPH & JOSEPH CO LPA <br>SUITE 200<br>155 W MAIN ST<br>COLUMBUS, OH 43215<br>(614) 449-8282<br><br>DEBORAH L MCNINCH<br>JOSEPH & JOSEPH CO LPA <br>THE WATERFORD, SUITE 200 <br>155 W MAIN ST<br>COLUMBUS, OH 43215<br>(614) 449-8282<br><br>S K DODDERER<br>155 W MAIN STREET<br>#200<br>COLUMBUS, OH 43215<br>(614) 449-8282</td>
</tr>
</tbody>'''
soup = BeautifulSoup(html, 'lxml')
att = [x.get_text(strip=True, separator=' ') for x in soup.select(
'#plaintiff-body tr:first-child > td:nth-child(4), #plaintiff-body tr:nth-child(2) > td:last-child')]
print(att)
电流输出:
['COURTNEY L HANNA', 'JOSEPH & JOSEPH CO LPA SUITE 200 155 W MAIN ST COLUMBUS, OH 43215 (614) 449-8282 DEBORAH L MCNINCH JOSEPH & JOSEPH CO LPA THE WATERFORD, SUITE 200 155 W MAIN ST COLUMBUS, OH 43215 (614) 449-8282 S K DODDERER 155 W MAIN STREET #200 COLUMBUS, OH 43215 (614) 449-8282']
期望输出:
['COURTNEY L HANNA', 'JOSEPH & JOSEPH CO LPA SUITE 200 155 W MAIN ST COLUMBUS, OH 43215 (614) 449-8282']
如何做到这一点?
我想用https://www.crummy.com/software/BeautifulSoup/bs4/doc/#a-function来传递一个函数,然后在匹配上循环,一旦我发现br
为空,我就会停止循环.
否则我可以得到x
而不是x.get_text()
,然后在><
上拆分得到第一个索引,然后使用https://w3lib.readthedocs.io/en/latest/w3lib.html?highlight=remove#w3lib.html.remove_tags
很高兴知道是有CSS
个直接的解决方案还是一个简单的解决方案.