我试图从这里刮MLB每日阵容信息:https://www.rotowire.com/baseball/daily-lineups.php
我try 使用python与requests,BeautifulSoup和pandas.
我的最终目标是最终得到两个panda数据帧.
首先是一个开始投球数据帧:
date | game_time | pitcher_name | team | lineup_throws |
---|---|---|---|---|
2024-03-29 | 1:40 PM ET | Spencer Strider | ATL | R |
2024-03-29 | 1:40 PM ET | Zack Wheeler | PHI | R |
第二个是起始击球手数据帧:
date | game_time | batter_name | team | pos | batting_order | lineup_bats |
---|---|---|---|---|---|---|
2024-03-29 | 1:40 PM ET | Ronald Acuna | ATL | RF | 1 | R |
2024-03-29 | 1:40 PM ET | Ozzie Albies | ATL | 2B | 2 | S |
2024-03-29 | 1:40 PM ET | Austin Riley | ATL | 3B | 3 | R |
2024-03-29 | 1:40 PM ET | Kyle Schwarber | PHI | DH | 1 | L |
2024-03-29 | 1:40 PM ET | Trea Turner | PHI | SS | 2 | R |
2024-03-29 | 1:40 PM ET | Bryce Harper | PHI | 1B | 3 | L |
这将是所有游戏的一天.
我试着根据我的需要调整这个答案,但似乎不能让它很好地工作:Scraping Web data using BeautifulSoup
任何帮助或指导都是非常感谢的.
下面是我试图适应的链接代码,但似乎无法取得进展:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "https://www.rotowire.com/baseball/daily-lineups.php"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
weather = []
for tag in soup.select(".lineup__bottom"):
header = tag.find_previous(class_="lineup__teams").get_text(
strip=True, separator=" vs "
)
rain = tag.select_one(".lineup__weather-text > b")
forecast_info = rain.next_sibling.split()
temp = forecast_info[0]
wind = forecast_info[2]
weather.append(
{"Header": header, "Rain": rain.text.split()[0], "Temp": temp, "Wind": wind}
)
df = pd.DataFrame(weather)
print(df)
我想要的信息似乎包含在lineup__main
,而不是lineup__bottom
.