我正在try 从:https://www.mecca.com.au/skin-care/中抓取每个产品的品牌、名称、图片URL

我在抓取图像URL时遇到了问题,因为没有href

例如,该产品的一个页面:https://www.mecca.com.au/tatcha/indigo-cleansing-balm/I-059912.html?cgpath=skincare

<picture class="css-1azgcry-container" id="image-reponsive-container" data-testid="imageReponsiveContainer">
  <img class="" title="Tatcha - Indigo Cleansing Balm" alt="Tatcha - Indigo   Cleansing Balm" src="https://www.mecca.com.au/on/demandware.static/-/Sites-mecca-online-catalog/default/dw0a0707d8/product/tatcha/hr/i-059912-indigo-cleansing-balm-7-1-940.jpg" data-testid="imageReponsiveImg"></picture>

我想提取的图像网址只以下src个例如.这里:

src="https://www.mecca.com.au/on/demandware.static/-/Sites-mecca-online-catalog/default/dw0a0707d8/product/tatcha/hr/i-059912-indigo-cleansing-balm-7-1-940.jpg"

以下是我的代码:

import requests
from bs4 import BeautifulSoup

url = "https://www.mecca.com.au/skin-care/"
params = {"start": "0", "sz": "36", "format": "ajax"}
for params["start"] in range(0, 36 * 5, 36):
    productlinks = []
    cataloguelist = []
    soup = BeautifulSoup(requests.get('https://www.mecca.com.au/skin-care/', params=params).content, "html.parser")
    products = soup.find_all('div', class_="grid-product-info")

    for item in products:
        for link in item.find_all('a', href=True):
            productlinks.append(url + link['href'])

    for link in productlinks:
        response = requests.get(link)

        soup = BeautifulSoup(response.content, "html.parser")
        brand = soup.find('a', class_='css-1p371np-size5-size5-sansSerif-sansSerif-brandNameLink').text
        name = soup.find('span', class_='css-1noela6-paragraph-paragraph-sansSerif-sansSerif-productName').text
### This is where I'm struggling, I tried 'picture', class_='css-1azgcry-container' as well
        image = soup.find('picture', attrs={'css-1azgcry-container', 'img', 'alt', 'src'})
        
print(brand, name, image)

谢谢你的帮助!

推荐答案

在您的代码中,如果您要打印(图像),您将看到:

<picture class="css-1azgcry-container" data-testid="imageReponsiveContainer" id="image-reponsive-container"><img alt="MECCA COSMETICA - Renewing Gel Cleanser" class="" data-testid="imageReponsiveImg" src="https://www.mecca.com.au/on/demandware.static/-/Sites-mecca-online-catalog/default/dwf0d61029/product/mecca/hr/i-061159-renewing-gel-cleanser-1-940.jpg" title="MECCA COSMETICA - Renewing Gel Cleanser"/></picture>

这基本上意味着你需要更深一层地挖掘.最好的方法是获取ID(id="image-reponsive-container),因为ID是唯一的.

不是:

image = soup.find('picture', attrs={'css-1azgcry-container', 'img', 'alt', 'src'})

使用:

image = soup.find(id='image-reponsive-container').find('img').get('src')

或者更好的是,有一个css Select 器:

image = soup.select_one('#image-reponsive-container > img')['src']

Python相关问答推荐

使用plotnine和Python构建地块

Django管理面板显示字段最大长度而不是字段名称

如何访问所有文件,例如环境变量

C#使用程序从Python中执行Exec文件

在Mac上安装ipython

如何在python polars中停止otherate(),当使用when()表达式时?

迭代嵌套字典的值

python中字符串的条件替换

如何使用Pandas DataFrame按日期和项目汇总计数作为列标题

使用groupby方法移除公共子字符串

Django—cte给出:QuerySet对象没有属性with_cte''''

用渐近模计算含符号的矩阵乘法

使用Python查找、替换和调整PDF中的图像'

计算空值

如何获得3D点的平移和旋转,给定的点已经旋转?

使用Python TCP套接字发送整数并使用C#接收—接收正确数据时出错

如何使用matplotlib查看并列直方图

当输入是字典时,`pandas. concat`如何工作?

Stats.ttest_ind:提取df值

FileNotFoundError:[WinError 2]系统找不到指定的文件:在os.listdir中查找扩展名