Python 使用 BeautifulSoup find_previous_sibling 进行刮擦无法按预期工作

发布于08月02日

我正试图从一个网站(https://carone.com.uy/autos-usados-y-0km?p=21)中获取几个价值.有些人在工作，但有些人没有.例如，我可以擦除名称、型号、价格和可燃物，但我无法正确擦除"año"或"kri"字段，代码总是返回"N/A"作为值.

以下是我使用的代码:

import pandas as pd
from datetime import date
import os
import socket
import requests
from bs4 import BeautifulSoup

def scrape_product_data(url):
    try:
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
        }

        product_data = []

        # Make the request to get the HTML content
        response = requests.get(url, headers=headers)
        response.raise_for_status()  # Check if the request was successful

        soup = BeautifulSoup(response.text, 'html.parser')
        product_elements = soup.find_all('div', class_='product-item-info')
        for product_element in product_elements:
            # Extract product name, price, model, and attributes as before (same code as previous version)
            product_name_element = product_element.select_one('p.carone-car-info-data-brand.cursor-pointer')
            product_name = product_name_element.text.strip() if product_name_element else "N/A"

            product_price_element = product_element.find('span', class_='price')
            product_price = product_price_element.text.strip() if product_price_element else "N/A"

            product_model_element = product_element.select_one('p.carone-car-info-data-model')
            product_model = product_model_element.get('title').strip() if product_model_element else "N/A"

            # Extract product attributes
            attributes_div = product_element.find('div', class_='carone-car-attributes')
            
            year_element = attributes_div.find('p', class_='carone-car-attribute-title', text='Año')
            year_value = year_element.find_previous_sibling('p', class_='carone-car-attribute-value').text if year_element else "N/A"

            kilometers_element = attributes_div.find('p', class_='carone-car-attribute-title', text='Kilómetros')
            kilometers_value = kilometers_element.find_previous_sibling('p', class_='carone-car-attribute-value').text if kilometers_element else "N/A"

            fuel_element = attributes_div.find('p', class_='carone-car-attribute-title', text='Combustible')
            fuel_value = fuel_element.find_previous_sibling('p', class_='carone-car-attribute-value').text if fuel_element else "N/A"

            # Append product data as a tuple (name, price, model, year, kilometers, fuel) to the list
            product_data.append((product_name, product_price, product_model, year_value, kilometers_value, fuel_value))

结果是这样的:enter image description here

我不明白为什么前面提到的值总是得到N/A，而其他值都运行得很好，方法也是一样的.

def scrape_product_data(url): headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3" } product_data = [] response = requests.get(url, headers=headers) response.raise_for_status() soup = BeautifulSoup(response.text, "html.parser") product_elements = soup.find_all("div", class_="product-item-info") for product_element in product_elements: product_name_element = product_element.select_one( "p.carone-car-info-data-brand.cursor-pointer" ) product_name = ( product_name_element.text.strip() if product_name_element else "N/A" ) product_price_element = product_element.find("span", class_="price") product_price = ( product_price_element.text.strip() if product_price_element else "N/A" ) product_model_element = product_element.select_one( "p.carone-car-info-data-model" ) product_model = ( product_model_element.get("title").strip() if product_model_element else "N/A" ) attributes_div = product_element.find("div", class_="carone-car-attributes") year_element = attributes_div.find( "p", class_="carone-car-attribute-title", string="A&ntildeo" ) year_value = ( year_element.find_previous_sibling( "p", class_="carone-car-attribute-value" ).text if year_element else "N/A" ) kilometers_element = attributes_div.find( "p", class_="carone-car-attribute-title", string="Kil&oacutemetros" ) kilometers_value = ( kilometers_element.find_previous_sibling( "p", class_="carone-car-attribute-value" ).text if kilometers_element else "N/A" ) fuel_element = attributes_div.find( "p", class_="carone-car-attribute-title", string="Combustible" ) fuel_value = ( fuel_element.find_previous_sibling( "p", class_="carone-car-attribute-value" ).text if fuel_element else "N/A" ) product_data.append( ( product_name, product_price, product_model, year_value, kilometers_value, fuel_value, ) ) return pd.DataFrame( product_data, columns=["Name", "Price", "Model", "Year", "KM", "Fuel"] ) df = scrape_product_data("https://carone.com.uy/autos-usados-y-0km?p=2") print(df)

Name Price Model Year KM Fuel 0 Renault Kwid US$12.000 KWID 1.0 INTENSE TACTIL 2018 82.390 NAFTA 1 Chevrolet Onix US$20.800 NEW ONIX 1.0T RS MT 2021 46.000 NAFTA 2 Suzuki Swift US$17.800 NUEVO SWIFT 1.2 GL AT 2020 63.641 NAFTA 3 Fiat Toro US$23.800 TORO 1.8 FREEDOM DC MT 2021 15.330 NAFTA 4 Renault Oroch US$26.300 NEW OROCH INTENS OUTSIDER 1.3T DC AT 2023 21.360 NAFTA 5 Renault Stepway US$15.100 STEPWAY PRIVILEGE 1.6 2017 60.010 NAFTA 6 Renault Kwid US$13.100 KWID 1.0 LIFE 2022 14 NAFTA 7 Chevrolet Onix US$22.800 NEW ONIX 1.0T PREMIER AT 2021 14.780 NAFTA 8 Nissan SENTRA B18 US$34.000 SENTRA B18 2.0 EXCLUSIVE AT 2022 30.430 NAFTA 9 Renault Kwid US$13.500 KWID 1.0 INTENSE MT 2020 37.660 NAFTA 10 Chevrolet Tracker US$16.300 TRACKER 1.8 LTZ 4X4 AT 2014 91.689 NAFTA 11 Chevrolet Onix US$18.600 NEW ONIX PLUS 1.2 LS 4P MT 2022 24.658 NAFTA

Python 使用 BeautifulSoup find_previous_sibling 进行刮擦无法按预期工作

推荐答案

Python相关问答推荐

根据不同列的值在收件箱中移动数据

如何使用pandasDataFrames和scipy高度优化相关性计算

如何使用matplotlib在Python中使用规范化数据和原始t测试值创建组合热图？

如何根据参数推断对象的返回类型？

. str.替换pandas.series的方法未按预期工作

使可滚动框架在tkinter环境中看起来自然

处理带有间隙(空)的duckDB上的重复副本并有效填充它们

根据二元组列表在pandas中创建新列

如何在solve()之后获得症状上的等式的值

如何使用Python以编程方式判断和检索Angular网站的动态内容？

Stacked bar chart from billrame

Python逻辑操作作为Pandas中的条件

无法连接到Keycloat服务器

Django—cte给出：QuerySet对象没有属性with_cte''''

合并与拼接并举

根据客户端是否正在传输响应来更改基于Flask的API的行为

为用户输入的整数查找根/幂整数对的Python练习

应用指定的规则构建数组

在matplotlib中重叠极 map 以创建径向龙卷风图

Python：在cmd中添加参数时的语法