Html 如何从 div 内的属性中提取 href 和文本

发布于07月22日

我正试图从https://pll.harvard.edu/catalog/free的哈佛大学网站上删除免费课程的名称和链接，使用的是Python和BeautifulSoup

我已经获得了每门课程的名称，但在try 提取指向该课程的链接时遇到了问题.例如，对于CS50的《S游戏开发入门》，以下是指向课程页面的链接的html:

<div class="field field--name-title field--type-string field--label-hidden field__items">
        <h3 class="field__item"><a href="/course/cs50s-introduction-game-development" hreflang="en">CS50's Introduction to Game Development</a></h3>
  </div>

我正在try 为页面上列出的每一门课程在a属性中获取"/Course/cs50s-Introduction-Game-Development"部分.

这是我目前拥有的获取课程名称的代码:

#gets the soup of the given url
def get_data(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    return soup

#gets the name and skill from each course and prints them
def get_all_content(firstURL):
    url = firstURL
    for i in range(1, 10):
        soup = get_data(url)
        print("PRINTING PAGE: " + str(url))
        course_names = soup.findAll("h3", attrs={"class": "field__item"})
        skills = soup.findAll("div", attrs={"class": "field field--name-title field--type-string field--label-hidden field__items"})

        for course, skill in zip(course_names, skills):
            print(course.text + "\n" + skill.text)
        url = f'http://pll.harvard.edu/catalog/free?page={i}'

        print("PRINTING NEWLY GOTTTEN URL: " + str(url))
        #sleep(randint(2, 10))


get_all_content(firstURL)

我try 了以下解决方案来获得href，从网上和研究中获得的.我最接近的情况是，在上面的第一个for循环中添加了以下内容:

 for div in soup.findAll("div", attrs={"class": "field field--name-title field--type-string field--label-hidden field__items"}):
      page_link = div.findAll("h3", attrs={"class":"field__item"})
      print("PRINTING LINK: " + str(page_link))

它为第一道菜打印了以下内容: 打印链接:[<h3 class="field__item"><a href="/course/cs50s-introduction-game-development" hreflang="en">CS50's Introduction to Game Development</a></h3>]

我所期待的只是这款part:"/course/cs50s-introduction-game-development"

我try 了许多我在网上找到的解决方案，但我经常收到错误，如FIND_ALL无法在此使用，或者没有一个没有Find属性或类似的错误.我对Python(本周开始)非常陌生，我不确定如何进一步改进这一点，语法非常令人困惑，我已经对其进行了研究，并达到了这个阶段.我已经将它降低到h3，它非常接近我正在寻找的东西，我觉得解决方案是如此简单，但我已经连续两天在它上面工作，但没有成功.我很感激你们的帮助.如何从div中的h3中的a属性中提取href？

import requests from bs4 import BeautifulSoup as bs import pandas as pd pd.set_option('display.max_columns', None) pd.set_option('display.max_colwidth', None) headers = { 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36' } s = requests.Session() s.headers.update(headers) big_list = [] for x in range(9): try: r = s.get(f'https://pll.harvard.edu/catalog/free?page={x}') is_next_page = bs(r.text, 'html.parser').select_one('li[class="pager__item pager__item--next pagination-next"]') courses = bs(r.text, 'html.parser').select('div[class="views-row"]') for course in courses: course_name = course.select_one('h3[class="field__item"]').get_text(strip=True, separator=' ') course_url = 'https://pll.harvard.edu' + course.select_one('h3[class="field__item"] a').get('href') big_list.append((course_name, course_url)) except Exception as e: print(e) break df = pd.DataFrame(set(big_list), columns = ['title', 'url']) print(df)

title url 0 CS50's Understanding Technology https://pll.harvard.edu/course/cs50s-understanding-technology 1 Global News & Technology Leadership in Challenging Times https://pll.harvard.edu/course/global-news-technology-leadership-challenging-times 2 China’s First Empires and the Rise of Buddhism https://pll.harvard.edu/course/chinas-first-empires-and-the-rise-of-buddhism 3 Deploying TinyML https://pll.harvard.edu/course/deploying-tinyml 4 Science & Cooking: From Haute Cuisine to Soft Matter Science (chemistry) https://pll.harvard.edu/course/science-and-cooking ... ... ... 123 PredictionX: John Snow and the Cholera Epidemic of 1854 https://pll.harvard.edu/course/predictionx-john-snow-and-cholera-epidemic-1854 124 CS50's Computer Science for Business Professionals https://pll.harvard.edu/course/cs50s-computer-science-business-professionals 125 CS50's Understanding Technology https://pll.harvard.edu/course/cs50s-understanding-technology-0/2023-05 126 Innovating in Health Care https://pll.harvard.edu/course/innovating-health-care 127 Data Science: Productivity Tools https://pll.harvard.edu/course/data-science-productivity-tools/2023-10 128 rows × 2 columns

Html 如何从 div 内的属性中提取 href 和文本

推荐答案

Html相关问答推荐

窗口对象中是否有对<；html&>根元素的引用？

Div内容防止同级大小增加

当我关闭时，导航栏跳到了新的生产线

根据按钮位置左/右对齐按钮

一切停止工作后，添加不透明度(在tailwind )

使用bash从html表格中提取表格

如何在小屏幕中制作水平滚动div

30000ms后超时重试：期望找到元素：someElement，但从未找到它

如何将元素与常规文本垂直对齐

如何为卡片块制作侧丝带样式

将箭头图标添加到工具提示包装器时遇到问题

对齐页面左侧的单选按钮

动态计算高度时 Div 不保持 1/1 纵横比

使用图像自定义 CSS 网格边框

如何将元素均匀分布到容器的边缘？

透明渐变凹矩形

Cargo 在网格中的排列

如何让 backdrop-filter： blur() 效果在 iOS 上运行？

如何通过可见文本使用 Selenium 定位元素

表格布局：固定；文本溢出单元格时不起作用