我正试图从https://pll.harvard.edu/catalog/free的哈佛大学网站上删除免费课程的名称和链接,使用的是Python和BeautifulSoup

我已经获得了每门课程的名称,但在try 提取指向该课程的链接时遇到了问题.例如,对于CS50的《S游戏开发入门》,以下是指向课程页面的链接的html:

<div class="field field--name-title field--type-string field--label-hidden field__items">
        <h3 class="field__item"><a href="/course/cs50s-introduction-game-development" hreflang="en">CS50's Introduction to Game Development</a></h3>
  </div>

我正在try 为页面上列出的每一门课程在a属性中获取"/Course/cs50s-Introduction-Game-Development"部分.

这是我目前拥有的获取课程名称的代码:

#gets the soup of the given url
def get_data(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    return soup

#gets the name and skill from each course and prints them
def get_all_content(firstURL):
    url = firstURL
    for i in range(1, 10):
        soup = get_data(url)
        print("PRINTING PAGE: " + str(url))
        course_names = soup.findAll("h3", attrs={"class": "field__item"})
        skills = soup.findAll("div", attrs={"class": "field field--name-title field--type-string field--label-hidden field__items"})

        for course, skill in zip(course_names, skills):
            print(course.text + "\n" + skill.text)
        url = f'http://pll.harvard.edu/catalog/free?page={i}'

        print("PRINTING NEWLY GOTTTEN URL: " + str(url))
        #sleep(randint(2, 10))


get_all_content(firstURL)

我try 了以下解决方案来获得href,从网上和研究中获得的.我最接近的情况是,在上面的第一个for循环中添加了以下内容:

 for div in soup.findAll("div", attrs={"class": "field field--name-title field--type-string field--label-hidden field__items"}):
      page_link = div.findAll("h3", attrs={"class":"field__item"})
      print("PRINTING LINK: " + str(page_link))

它为第一道菜打印了以下内容: 打印链接:[<h3 class="field__item"><a href="/course/cs50s-introduction-game-development" hreflang="en">CS50's Introduction to Game Development</a></h3>]

我所期待的只是这款part:"/course/cs50s-introduction-game-development"

我try 了许多我在网上找到的解决方案,但我经常收到错误,如FIND_ALL无法在此使用,或者没有一个没有Find属性或类似的错误.我对Python(本周开始)非常陌生,我不确定如何进一步改进这一点,语法非常令人困惑,我已经对其进行了研究,并达到了这个阶段.我已经将它降低到h3,它非常接近我正在寻找的东西,我觉得解决方案是如此简单,但我已经连续两天在它上面工作,但没有成功.我很感激你们的帮助.如何从div中的h3中的a属性中提取href?

推荐答案

以下是获取这些数据的一种方法:

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}

s = requests.Session()
s.headers.update(headers)

big_list = []
for x in range(9):
    try:
        r = s.get(f'https://pll.harvard.edu/catalog/free?page={x}')
        is_next_page = bs(r.text, 'html.parser').select_one('li[class="pager__item pager__item--next pagination-next"]')
        courses = bs(r.text, 'html.parser').select('div[class="views-row"]')
        for course in courses:
            course_name = course.select_one('h3[class="field__item"]').get_text(strip=True, separator=' ')
            course_url = 'https://pll.harvard.edu' + course.select_one('h3[class="field__item"] a').get('href')
            big_list.append((course_name, course_url))
    except Exception as e:
        print(e)
        break
df = pd.DataFrame(set(big_list), columns = ['title', 'url'])
print(df)

结果为终端:

    title   url
0   CS50's Understanding Technology     https://pll.harvard.edu/course/cs50s-understanding-technology
1   Global News & Technology Leadership in Challenging Times    https://pll.harvard.edu/course/global-news-technology-leadership-challenging-times
2   China’s First Empires and the Rise of Buddhism  https://pll.harvard.edu/course/chinas-first-empires-and-the-rise-of-buddhism
3   Deploying TinyML    https://pll.harvard.edu/course/deploying-tinyml
4   Science & Cooking: From Haute Cuisine to Soft Matter Science (chemistry)    https://pll.harvard.edu/course/science-and-cooking
...     ...     ...
123     PredictionX: John Snow and the Cholera Epidemic of 1854     https://pll.harvard.edu/course/predictionx-john-snow-and-cholera-epidemic-1854
124     CS50's Computer Science for Business Professionals  https://pll.harvard.edu/course/cs50s-computer-science-business-professionals
125     CS50's Understanding Technology     https://pll.harvard.edu/course/cs50s-understanding-technology-0/2023-05
126     Innovating in Health Care   https://pll.harvard.edu/course/innovating-health-care
127     Data Science: Productivity Tools    https://pll.harvard.edu/course/data-science-productivity-tools/2023-10

128 rows × 2 columns

有关详细信息,请参阅requestsBeautifulSoup的文档.

Html相关问答推荐

窗口对象中是否有对<;html&>根元素的引用?

Div内容防止同级大小增加

当我关闭时,导航栏跳到了新的生产线

根据按钮位置左/右对齐按钮

一切停止工作后,添加不透明度(在tailwind )

使用bash从html表格中提取表格

如何在小屏幕中制作水平滚动div

30000ms后超时重试:期望找到元素:someElement,但从未找到它

如何将 元素与常规文本垂直对齐

如何为卡片块制作侧丝带样式

将箭头图标添加到工具提示包装器时遇到问题

对齐页面左侧的单选按钮

动态计算高度时 Div 不保持 1/1 纵横比

使用图像自定义 CSS 网格边框

如何将元素均匀分布到容器的边缘?

透明渐变凹矩形

Cargo 在网格中的排列

如何让 backdrop-filter: blur() 效果在 iOS 上运行?

如何通过可见文本使用 Selenium 定位元素

表格布局:固定;文本溢出单元格时不起作用