在 Python 中使用 Selenium BeautifulSoup 抓取网页需要更多时间

发布于05月13日

我从Products页面中删除产品链接<a href="">，并将它们存储在数组hrefs中

from bs4 import BeautifulSoup
from selenium import webdriver
import os
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--headless")
service = webdriver.chrome.service.Service(executable_path=os.getcwd() + "./chromedriver.exe")
driver = webdriver.Chrome(service=service, options=chrome_options)
driver.set_page_load_timeout(900)
link = 'https://www.catch.com.au/seller/vdoo/products.html?page=1'
driver.get(link)
soup = BeautifulSoup(driver.page_source, 'lxml')
product_links = soup.find_all("a", class_="css-1k3ukvl")

hrefs = []
for product_link in product_links:
    href = product_link.get("href")
    if href.startswith("/"):
        href = "https://www.catch.com.au" + href
    hrefs.append(href)

大约有36个链接存储在所有36个产品在页面上的数组，然后我开始从hrefs挑选每个链接，并前往它，并从每个链接删除进一步的数据.

products = []
for href in hrefs:
    driver.get(href)
    soup = BeautifulSoup(driver.page_source, 'lxml')
    
    title = soup.find("h1", class_="e12cshkt0").text.strip()
    price = soup.find("span", class_="css-1qfcjyj").text.strip()
    image_link = soup.find("img", class_="css-qvzl9f")["src"]
    product = {
        "title": title,
        "price": price,
        "image_link": image_link
    }
    products.append(product)
driver.quit()
print(len(products))

但这花了太多时间.我已经设置了900秒，但超时了.问题:

在开始，现在，我只是从第一页提取产品链接，但我有更多的页面，如多达40页，每页36个产品.当我实现从所有页面获取数据时，它也超时.
然后在第二部分，当我使用这些链接，并取消每个链接，那么它也需要更多的时间. 我怎样才能减少这个程序的执行时间.我可以把节目分成几个部分吗？

import requests from bs4 import BeautifulSoup api_url = "https://www.catch.com.au/seller/vdoo/products.json" params = { "page": 1, # <-- to get other pages, increase this parameter } data = requests.get(api_url, params=params).json() urls = [] for r in data['payload']['results']: urls.append(f"https://www.catch.com.au{r['product']['productPath']}") for url in urls: soup = BeautifulSoup(requests.get(url).content, 'html.parser') price = soup.select_one('[itemprop=price]')['content'] title = soup.h1.text print(f'{title:<100} {price:<5}')

2x Pure Natural Cotton King Size Pillow Case Cover Slip - 54x94cm - White 46.99 Fire Starter Lighter Waterproof Flint Match Metal Keychain Camping Survival - Gold 20.89 Plain Solid Colour Cushion Cover Covers Decorative Pillow Case - Apple Green 20.9 2000TC 4PCS Bed Sheet Set Flat Fitted Pillowcase Single Double Queen King Bed - Black 57.18 All Size Bed Ultra Soft Quilt Duvet Doona Cover Set Bedding - Paris Eiffel Tower 50.99 ...and so on.

在 Python 中使用 Selenium BeautifulSoup 抓取网页需要更多时间

推荐答案

Python相关问答推荐

Django注释：将时差转换为小数或小数

如何使用PyTest根据self 模拟具有副作用的属性

给定数据点，制定它们的关系

Google Drive API获取文件计量数据

从包含数字和单词的文件中读取和获取数据集

添加包含中具有任何值的其他列的计数的列

如何通过多2多字段过滤查询集

在Wayland上使用setCellWidget时，try 编辑QTable Widget中的单元格时，PyQt 6崩溃

通过pandas向每个非空单元格添加子字符串

如果条件不满足，我如何获得掩码的第一个索引并获得None？

优化器的运行顺序影响PyTorch中的预测

组/群集按字符串中的子字符串或子字符串中的字符串轮询数据框

如果满足某些条件，则用另一个数据帧列中的值填充空数据帧或数组

什么是最好的方法来切割一个相框到一个面具的第一个实例？

在嵌套span下的span中擦除信息

Geopandas未返回正确的缓冲区(单位：米)

用SymPy在Python中求解指数函数

将一个双框爆炸到另一个双框的范围内

Python 3试图访问在线程调用中实例化的类的对象

使用SeleniumBase保存和加载Cookie时出现问题