Python 无法抓取网页

发布于07月08日

我在用Spyder刮一个有多个页面的网页时遇到了问题:这个网页有1到6个页面，还有一个next按钮.此外，六页中的每一页都有30个结果.我try 了两种解决方案，但都没有成功.

这是第一个:

#SOLUTION 1#
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get('https://store.unionlosangeles.com/collections/outerwear?sort_by=creation_date&page_num=1')

#Imports the HTML of the webpage into python      
soup = BeautifulSoup(driver.page_source, 'lxml')

postings = soup.find_all('div', class_ = 'isp_grid_product')

#Creates data frame
df = pd.DataFrame({'Link':[''], 'Vendor':[''],'Title':[''], 'Price':['']})

#Scrape the data
for i in range (1,7): #I've also tried with range (1,6), but it gives 5 pages instead of 6.
    url = "https://store.unionlosangeles.com/collections/outerwear?sort_by=creation_date&page_num="+str(i)+""
    postings = soup.find_all('li', class_ = 'isp_grid_product')
    for post in postings:
        link = post.find('a', class_ = 'isp_product_image_href').get('href')
        link_full = 'https://store.unionlosangeles.com'+link
        vendor = post.find('div', class_ = 'isp_product_vendor').text.strip()
        title = post.find('div', class_ = 'isp_product_title').text.strip()
        price = post.find('div', class_ = 'isp_product_price_wrapper').text.strip()
        df = df.append({'Link':link_full, 'Vendor':vendor,'Title':title, 'Price':price}, ignore_index = True)

该代码的输出是一个具有180行(30 x 6)的数据帧，但它会重复结果

这是我try 的第二种解决方案:

### SOLUTION 2 ###

from selenium import webdriver
import requests
from bs4 import BeautifulSoup
import pandas as pd
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager


driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get('https://store.unionlosangeles.com/collections/outerwear?sort_by=creation_date&page_num=1')

#Imports the HTML of the webpage into python      
soup = BeautifulSoup(driver.page_source, 'lxml')
soup

#Create data frame
df = pd.DataFrame({'Link':[''], 'Vendor':[''],'Title':[''], 'Price':['']})

#Scrape data
i = 0
while i < 6:
    
    postings = soup.find_all('li', class_ = 'isp_grid_product')
    len(postings)

    for post in postings:
        link = post.find('a', class_ = 'isp_product_image_href').get('href')
        link_full = 'https://store.unionlosangeles.com'+link
        vendor = post.find('div', class_ = 'isp_product_vendor').text.strip()
        title = post.find('div', class_ = 'isp_product_title').text.strip()
        price = post.find('div', class_ = 'isp_product_price_wrapper').text.strip()
        df = df.append({'Link':link_full, 'Vendor':vendor,'Title':title, 'Price':price}, ignore_index = True)

    #Imports the next pages HTML into python
    next_page = 'https://store.unionlosangeles.com'+soup.find('div', class_ = 'page-item next').get('href')
    page = requests.get(next_page)
    soup = BeautifulSoup(page.text, 'lxml')
    i += 1

第二种解决方案的问题是，由于我无法理解的原因，程序无法识别next_page中的属性"get"(我在其他分页网站中没有遇到过这个问题).因此，我只得到第一页，而没有其他页面.

如何修复代码以正确地刮除所有180个元素？

import requests import pandas as pd from bs4 import BeautifulSoup from urllib.parse import urlparse, parse_qs url = "https://store.unionlosangeles.com/collections/outerwear?sort_by=creation_date&page_num=1" api_url = "https://cdn-gae-ssl-premium.akamaized.net/categories_navigation" soup = BeautifulSoup(requests.get(url).content, "html.parser") params = { "page_num": 1, "store_id": "", "UUID": "", "sort_by": "creation_date", "facets_required": "0", "callback": "", "related_search": "1", "category_url": "/collections/outerwear", } q = parse_qs( urlparse(soup.select_one("#isp_search_result_page ~ script")["src"]).query ) params["store_id"] = q["store_id"][0] params["UUID"] = q["UUID"][0] all_data = [] for params["page_num"] in range(1, 7): data = requests.get(api_url, params=params).json() for i in data["items"]: link = i["u"] vendor = i["v"] title = i["l"] price = i["p"] all_data.append([link, vendor, title, price]) df = pd.DataFrame(all_data, columns=["link", "vendor", "title", "price"]) print(df.head(10).to_markdown(index=False)) print("Total items =", len(df))

link	vendor	title	price
/products/barn-jacket	Essentials	BARN JACKET	250
/products/work-vest-2	Essentials	WORK VEST	120
/products/tailored-track-jacket	Martine Rose	TAILORED TRACK JACKET	1206
/products/work-vest-1	Essentials	WORK VEST	120
/products/60-40-cloth-bug-anorak-1tone	Kapital	60/40 Cloth BUG Anorak (1Tone)	747
/products/smooth-jersey-stand-man-woman-track-jkt	Kapital	Smooth Jersey STAND MAN & WOMAN Track JKT	423
/products/supersized-sports-jacket	Martine Rose	SUPERSIZED SPORTS JACKET	1695
/products/pullover-vest	Nicholas Daley	PULLOVER VEST	267
/products/flannel-polkadot-x-bandana-reversible-1st-jkt-1	Kapital	FLANNEL POLKADOT X BANDANA REVERSIBLE 1ST JKT	645
/products/60-40-cloth-bug-anorak-1tone-1	Kapital	60/40 Cloth BUG Anorak (1Tone)	747

link

vendor

title

price

/products/barn-jacket

Essentials

BARN JACKET

250

/products/work-vest-2

Essentials

WORK VEST

120

/products/tailored-track-jacket

Martine Rose

TAILORED TRACK JACKET

1206

/products/work-vest-1

Essentials

WORK VEST

120

/products/60-40-cloth-bug-anorak-1tone

Kapital

60/40 Cloth BUG Anorak (1Tone)

747

/products/smooth-jersey-stand-man-woman-track-jkt

Kapital

Smooth Jersey STAND MAN & WOMAN Track JKT

423

/products/supersized-sports-jacket

Martine Rose

SUPERSIZED SPORTS JACKET

1695

/products/pullover-vest

Nicholas Daley

PULLOVER VEST

267

/products/flannel-polkadot-x-bandana-reversible-1st-jkt-1

Kapital

FLANNEL POLKADOT X BANDANA REVERSIBLE 1ST JKT

645

/products/60-40-cloth-bug-anorak-1tone-1

Kapital

60/40 Cloth BUG Anorak (1Tone)

747

Python 无法抓取网页

推荐答案

Python相关问答推荐

将DF中的名称与另一DF拆分并匹配并返回匹配的公司

NP.round解算数据后NP.unique

修复mypy错误-赋值中的类型不兼容(表达式具有类型xxx，变量具有类型yyy)

如何设置视频语言时上传到YouTube与Python API客户端

pandas：排序多级列

转换为浮点，pandas字符串列，混合千和十进制分隔符

未知依赖项pin—1阻止conda安装""

无论输入分辨率如何，稳定扩散管道始终输出512 * 512张图像

如何在PySide/Qt QColumbnView中删除列

如何使用两个关键函数来排序一个多索引框架？

交替字符串位置的正则表达式

将一个双框爆炸到另一个双框的范围内

语法错误：文档. evaluate：表达式不是合法表达式

如何在验证文本列表时使正则表达式无序？

有没有办法让Re.Sub报告它所做的每一次替换？

如何防止html代码出现在quarto gfm报告中的pandas表之上

为什么Visual Studio Code说我的代码在使用Pandas concat函数后无法访问？

我如何处理超类和子类的情况

与同步和异步客户端兼容的Python函数

是否将Pandas 数据帧标题/标题以纯文本格式转换为字符串输出？