正如标题所述,我正在try 运行我的scrapy程序,我遇到的问题是,它似乎只返回初始url(https://www.antaira.com/products/10-100Mbps)的yield 率.

我不确定我的程序在哪里不工作,在我的代码中,我还留下了一些关于我所try 的注释代码.

import scrapy
from ..items import AntairaItem


class ProductJumperFix(scrapy.Spider):  # classes should be TitleCase

    name = 'productJumperFix'
    allowed_domains = ['antaira.com']
    start_urls = [
        'https://www.antaira.com/products/10-100Mbps',
        'https://www.antaira.com/products/unmanaged-gigabit'
        'https://www.antaira.com/products/unmanaged-10-100Mbps-PoE'
        'https://www.antaira.com/products/Unmanaged-Gigabit-PoE'
        'https://www.antaira.com/products/Unmanaged-10-gigabit'
        'https://www.antaira.com/products/Unmanaged-10-gigabit-PoE'
    ]
    
    #def start_requests(self):
    #    yield scrappy.Request(start_urls, self.parse)

    def parse(self, response):
        # iterate through each of the relative urls
        for url in response.xpath('//div[@class="product-container"]//a/@href').getall():
            product_link = response.urljoin(url)  # use variable
            yield scrapy.Request(product_link, callback=self.parse_new_item)

    def parse_new_item(self, response):
        for product in response.css('main.products'):
            items = AntairaItem() # Unique item for each iteration
            items['product_link'] = response.url # get the product link from response
            name = product.css('h1.product-name::text').get().strip()
            features = product.css(('section.features h3 + ul').strip()).getall()
            overview =   product.css('.products .product-overview::text').getall()
            main_image = response.urljoin(product.css('div.selectors img::attr(src)').get())
            rel_links = product.xpath("//script/@src[contains(., '/app/site/hosting/scriptlet.nl')]").getall()
            items['name'] = name,
            items['features'] = features,
            items['overview'] = overview,
            items['main_image'] = main_image,
            items['rel_links'] = rel_links,
            yield items

谢谢大家!

后续问题,由于某些原因,当我运行"scrapy Crawl ProductJumperFix"时,我没有从终端获得任何输出,不确定如何调试,因为我甚至看不到输出错误.

推荐答案

try 使用start_requests方法:

例如:

import scrapy
from ..items import AntairaItem

class ProductJumperFix(scrapy.Spider):

    name = 'productJumperFix'
    allowed_domains = ['antaira.com']

    def start_requests(self):
        urls = [
            'https://www.antaira.com/products/10-100Mbps',
            'https://www.antaira.com/products/unmanaged-gigabit',
            'https://www.antaira.com/products/unmanaged-10-100Mbps-PoE',
            'https://www.antaira.com/products/Unmanaged-Gigabit-PoE',
            'https://www.antaira.com/products/Unmanaged-10-gigabit',
            'https://www.antaira.com/products/Unmanaged-10-gigabit-PoE',
        ]
        for url in urls:
            yield scrapy.Request(url, callback=self.parse)

    def parse(self, response):
        for url in response.xpath('//div[@class="product-container"]//a/@href').getall():
            product_link = response.urljoin(url)  # use variable
            yield scrapy.Request(product_link, callback=self.parse_new_item)

    def parse_new_item(self, response):
        for product in response.css('main.products'):
            items = AntairaItem() 
            items['product_link'] = response.url
            name = product.css('h1.product-name::text').get().strip()
            features = product.css(('section.features h3 + ul').strip()).getall()
            overview =   product.css('.products .product-overview::text').getall()
            main_image = response.urljoin(product.css('div.selectors img::attr(src)').get())
            rel_links = product.xpath("//script/@src[contains(., '/app/site/hosting/scriptlet.nl')]").getall()
            items['name'] = name,
            items['features'] = features,
            items['overview'] = overview,
            items['main_image'] = main_image,
            items['rel_links'] = rel_links,
            yield items

Python相关问答推荐

如何在箱形图中添加绘制线的传奇?

如何检测背景有噪的图像中的正方形

根据另一列中的nan重置值后重新加权Pandas列

不理解Value错误:在Python中使用迭代对象设置时必须具有相等的len键和值

将两只Pandas rame乘以指数

将输入管道传输到正在运行的Python脚本中

更改键盘按钮进入'

如何调整QscrollArea以正确显示内部正在变化的Qgridlayout?

如何在WSL2中更新Python到最新版本(3.12.2)?

对象的`__call__`方法的setattr在Python中不起作用'

使用Python更新字典中的值

Pandas—在数据透视表中占总数的百分比

如何使用OpenGL使球体遵循Python中的八样路径?

ruamel.yaml dump:如何阻止map标量值被移动到一个新的缩进行?

Pandas—MultiIndex Resample—我不想丢失其他索引的信息´

不允许 Select 北极滚动?

Scipy差分进化:如何传递矩阵作为参数进行优化?

在任何要保留的字段中添加引号的文件,就像在Pandas 中一样

.awk文件可以使用子进程执行吗?

PyTorch变压器编码器中的填充掩码问题