Python 如何在 BeautifulSoup 中抓取具有特定 id 的特定元素

发布于05月11日

我正在try 从棒球参考:https://www.baseball-reference.com/players/b/bondsba01.shtml中提取table，而我想要的the table是具有id="batting_value"的那个，但是当我try 从已经提取的内容中提取print时，程序返回了一个空的list.如有任何信息或帮助，欢迎光临，谢谢！

from bs4 import BeautifulSoup
from urllib.request import urlopen

root_page = "https://www.baseball-reference.com/players/b/bondsba01.shtml"
soup = BeautifulSoup(urlopen(root_page), features = 'lxml')

table = soup.find('table', id = 'batting_value')
print(table)

我试着用包含table的id="div_batting_value"对<div>进行print，但仍然不起作用.然而，我可以成功地打印出其他<div>个元素与不同的id.

推荐答案

这里的主要问题是table隐藏在 comments 中，所以你必须先把它拿出来，BeautifulSoup才能找到它--在我看来，最简单的解决方案是替换这种情况下的特定字符:

.replace('<!--','').replace('-->','')

另一种 Select 是更具体地使用bs4.Comment

Example

import requests
from bs4 import BeautifulSoup

soup = BeautifulSoup(
        requests.get('https://www.baseball-reference.com/players/b/bondsba01.shtml').text.replace('<!--','').replace('-->','')
)
soup.select_one('#batting_value')

或与pandas.read_html()一起使用:

import requests
import pandas as pd

df = pd.read_html(requests.get('https://www.baseball-reference.com/players/b/bondsba01.shtml').text.replace('<!--','').replace('-->',''), attrs={'id':'batting_value'})[0]
df[(~df.Lg.isna()) & (df.Lg != 'Lg')]

结果是:

	Year	Age	Tm	Lg	G	PA	Rbat	Rbaser	Rdp	Rfield	Rpos	RAA	WAA	Rrep	RAR	WAR	waaWL%	162WL%	oWAR	dWAR	oRAR	Salary	Pos	Awards
0	1986	21	PIT	NL	113	484	3	5	0	8	1	17	1.9	16	34	3.5	0.517	0.512	2.6	1	25	$60,000	*8/H	RoY-6
1	1987	22	PIT	NL	150	611	11	3	1	24	-3	36	3.7	21	57	5.8	0.525	0.523	3.2	2.1	33	$100,000	*78H/9	nan
...
20	2006	41	SFG	NL	130	493	30	1	0	1	-4	27	2.5	15	42	4	0.52	0.516	3.9	-0.4	41	$19,331,470	*7H/D	nan
21	2007	42	SFG	NL	126	477	37	-1	-1	-10	-4	21	2	15	36	3.4	0.516	0.513	4.4	-1.5	46	$15,533,970	*7H/D	AS