As the title states, I am trying to run my scrapy program, the issue I am running into is that it seems to be only returning the yield from the initial url (https://www.antaira.com/products/10-100Mbps).

I am unsure on where my program is not working, in my code I have also left some commented code on what I have attempted.

import scrapy
from ..items import AntairaItem


class ProductJumperFix(scrapy.Spider):  # classes should be TitleCase

    name = 'productJumperFix'
    allowed_domains = ['antaira.com']
    start_urls = [
        'https://www.antaira.com/products/10-100Mbps',
        'https://www.antaira.com/products/unmanaged-gigabit'
        'https://www.antaira.com/products/unmanaged-10-100Mbps-PoE'
        'https://www.antaira.com/products/Unmanaged-Gigabit-PoE'
        'https://www.antaira.com/products/Unmanaged-10-gigabit'
        'https://www.antaira.com/products/Unmanaged-10-gigabit-PoE'
    ]
    
    #def start_requests(self):
    #    yield scrappy.Request(start_urls, self.parse)

    def parse(self, response):
        # iterate through each of the relative urls
        for url in response.xpath('//div[@class="product-container"]//a/@href').getall():
            product_link = response.urljoin(url)  # use variable
            yield scrapy.Request(product_link, callback=self.parse_new_item)

    def parse_new_item(self, response):
        for product in response.css('main.products'):
            items = AntairaItem() # Unique item for each iteration
            items['product_link'] = response.url # get the product link from response
            name = product.css('h1.product-name::text').get().strip()
            features = product.css(('section.features h3 + ul').strip()).getall()
            overview =   product.css('.products .product-overview::text').getall()
            main_image = response.urljoin(product.css('div.selectors img::attr(src)').get())
            rel_links = product.xpath("//script/@src[contains(., '/app/site/hosting/scriptlet.nl')]").getall()
            items['name'] = name,
            items['features'] = features,
            items['overview'] = overview,
            items['main_image'] = main_image,
            items['rel_links'] = rel_links,
            yield items

Thank you everyone!

Follow up question, for some reason when I run "scrapy crawl productJumperFix" im not getting any output from the terminal,not sure how to debug since I can't even see the output errors.

推荐答案

Try using the start_requests method:

For example:

import scrapy
from ..items import AntairaItem

class ProductJumperFix(scrapy.Spider):

    name = 'productJumperFix'
    allowed_domains = ['antaira.com']

    def start_requests(self):
        urls = [
            'https://www.antaira.com/products/10-100Mbps',
            'https://www.antaira.com/products/unmanaged-gigabit',
            'https://www.antaira.com/products/unmanaged-10-100Mbps-PoE',
            'https://www.antaira.com/products/Unmanaged-Gigabit-PoE',
            'https://www.antaira.com/products/Unmanaged-10-gigabit',
            'https://www.antaira.com/products/Unmanaged-10-gigabit-PoE',
        ]
        for url in urls:
            yield scrapy.Request(url, callback=self.parse)

    def parse(self, response):
        for url in response.xpath('//div[@class="product-container"]//a/@href').getall():
            product_link = response.urljoin(url)  # use variable
            yield scrapy.Request(product_link, callback=self.parse_new_item)

    def parse_new_item(self, response):
        for product in response.css('main.products'):
            items = AntairaItem() 
            items['product_link'] = response.url
            name = product.css('h1.product-name::text').get().strip()
            features = product.css(('section.features h3 + ul').strip()).getall()
            overview =   product.css('.products .product-overview::text').getall()
            main_image = response.urljoin(product.css('div.selectors img::attr(src)').get())
            rel_links = product.xpath("//script/@src[contains(., '/app/site/hosting/scriptlet.nl')]").getall()
            items['name'] = name,
            items['features'] = features,
            items['overview'] = overview,
            items['main_image'] = main_image,
            items['rel_links'] = rel_links,
            yield items

Python相关问答推荐

根据条件按以上 5 行过滤数据框

在组的子组中使用模式中的缺失值替换网络

Cloud Composer 写入文件和 dag 执行后消失的文件

如何在 Swagger UI 的 PATCH API 中设置必填字段

如何创建“if in”功能判断?

重复数据帧并增加时间戳

从嵌套字典到扁平数据框

计算平均交付时间(天)Django ORM

tensorflow keras RandomForestModel get_config() 为空

了解 pyspark 适用于 groupby

基于其他列的值并遵守预先制定的规则的新列

弗洛伊德三角形与数学,而不是字符串

如何在两个 M2M 值上使用 prefetch_related?

获取与数据框的另一个值最接近的值

Numpy - 根据索引位置将数据映射到矩阵

如何对使用 conda 构建的 python 应用程序进行 dockerize

求一个月内最大的 n 的平均值,但天必须是唯一的(Pandas )

我很困惑我们是否可以通过不提及代码中元素的索引来访问列表的内容.有人可以解释一下吗?

*不迭代*数据框中的行并根据字典列表中的值插入列

Pytorch 中的梯度下降重新分配