我想抓取https://www.rotowire.com/baseball/news.php个包含有关MLB球员的新闻,并以表格格式保存数据,如下所示:

date player headline news
4/17 Abner Uribe Picks up second win Uribe (2-1) earned the win Wednesday against the Padres after he allowed a hit and no walks in a scoreless eighth inning. He had one strikeout.
4/17 Richie Palacios Gets day off vs. lefty Palacios is out of the lineup for Wednesday's game against the Angels.

我很难理解如何将每个内容隔离到其自己的行中,并将其放入收件箱中.正在寻找任何帮助来实现这一目标.理想情况下,我会每5分钟刮一次,并保持桌子不断增长.

推荐答案

要将该页面中的所有信息获取到收件箱中,您可以使用下一个示例:

import pandas as pd
import requests
from bs4 import BeautifulSoup

url = "https://www.rotowire.com/baseball/news.php"

soup = BeautifulSoup(requests.get(url).content, "html.parser")

all_data = []
for n in soup.select(".news-update"):
    name = n.a.text
    h = n.select_one(".news-update__headline").text
    dt = n.select_one(".news-update__timestamp").text
    news = n.select_one(".news-update__news").text
    all_data.append({"Name": name, "Headline": h, "Date": dt, "News": news})

df = pd.DataFrame(all_data)
print(df.head())

输出:

               Name                            Headline            Date                                                                                                                                                News
0       Joe Jacques              Recalled from Triple-A  April 17, 2024                                 Jacques was recalled from Triple-A Worcester by the Red Sox on Wednesday, Mac Cerullo of the Boston Herald reports.
1    Cedric Mullins                     Walks off Twins  April 17, 2024                                                Mullins went 1-for-4 with a walk-off, two-run home run during Wednesday's 4-2 win against the Twins.
2  Garrett Whitlock               Lands on injured list  April 17, 2024    Whitlock was placed on the 15-day injured list by the Red Sox on Wednesday with a left oblique strain, Mac Cerullo of the Boston Herald reports.
3        Eli Morgan  Shelved with shoulder inflammation  April 17, 2024  The Guardians placed Morgan on the 15-day injured list Wednesday with right shoulder inflammation, Joe Noga of The Cleveland Plain Dealer reports.
4     Craig Kimbrel                     Earns third win  April 17, 2024      Kimbel (3-0) earned the win Wednesday against the Twins after he retired all three batters he faced in the ninth inning. He had one strikeout.

注:我建议将所有这些信息放入SQL数据库(例如SQLite -它包含在Python中,不会插入任何重复项)并设置cronJob每5分钟运行一次该脚本.

Python相关问答推荐

如何通过多2多字段过滤查询集

如何使用Python将工作表从一个Excel工作簿复制粘贴到另一个工作簿?

Python 约束无法解决n皇后之谜

未删除映射表的行

_repr_html_实现自定义__getattr_时未显示

Mistral模型为不同的输入文本生成相同的嵌入

使用groupby Pandas的一些操作

删除字符串中第一次出现单词后的所有内容

如何在给定的条件下使numpy数组的计算速度最快?

Pandas DataFrame中行之间的差异

pandas:排序多级列

joblib:无法从父目录的另一个子文件夹加载转储模型

dask无groupby(ddf. agg([min,max])?''''

Python Pandas—时间序列—时间戳缺失时间精确在00:00

交替字符串位置的正则表达式

从嵌套极轴列的列表中删除元素

使用SQLAlchemy从多线程Python应用程序在postgr中插入多行的最佳方法是什么?'

Scipy差分进化:如何传递矩阵作为参数进行优化?

使用pythonminidom过滤XML文件

按列表分组到新列中