这里的主要问题是table
隐藏在 comments 中,所以你必须先把它拿出来,BeautifulSoup
才能找到它--在我看来,最简单的解决方案是替换这种情况下的特定字符:
.replace('<!--','').replace('-->','')
另一种 Select 是更具体地使用bs4.Comment
Example
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(
requests.get('https://www.baseball-reference.com/players/b/bondsba01.shtml').text.replace('<!--','').replace('-->','')
)
soup.select_one('#batting_value')
或与pandas.read_html()
一起使用:
import requests
import pandas as pd
df = pd.read_html(requests.get('https://www.baseball-reference.com/players/b/bondsba01.shtml').text.replace('<!--','').replace('-->',''), attrs={'id':'batting_value'})[0]
df[(~df.Lg.isna()) & (df.Lg != 'Lg')]
结果是:
|
Year |
Age |
Tm |
Lg |
G |
PA |
Rbat |
Rbaser |
Rdp |
Rfield |
Rpos |
RAA |
WAA |
Rrep |
RAR |
WAR |
waaWL% |
162WL% |
oWAR |
dWAR |
oRAR |
Salary |
Pos |
Awards |
0 |
1986 |
21 |
PIT |
NL |
113 |
484 |
3 |
5 |
0 |
8 |
1 |
17 |
1.9 |
16 |
34 |
3.5 |
0.517 |
0.512 |
2.6 |
1 |
25 |
$60,000 |
*8/H |
RoY-6 |
1 |
1987 |
22 |
PIT |
NL |
150 |
611 |
11 |
3 |
1 |
24 |
-3 |
36 |
3.7 |
21 |
57 |
5.8 |
0.525 |
0.523 |
3.2 |
2.1 |
33 |
$100,000 |
*78H/9 |
nan |
... |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
20 |
2006 |
41 |
SFG |
NL |
130 |
493 |
30 |
1 |
0 |
1 |
-4 |
27 |
2.5 |
15 |
42 |
4 |
0.52 |
0.516 |
3.9 |
-0.4 |
41 |
$19,331,470 |
*7H/D |
nan |
21 |
2007 |
42 |
SFG |
NL |
126 |
477 |
37 |
-1 |
-1 |
-10 |
-4 |
21 |
2 |
15 |
36 |
3.4 |
0.516 |
0.513 |
4.4 |
-1.5 |
46 |
$15,533,970 |
*7H/D |
AS |