Scrapy 和 Python 解析

发布于05月11日

我在学羊毛皮.例如，有一个网站http://quotes.toscrape.com. 我正在创建一个简单的蜘蛛(卑鄙的基因蜘蛛引语). 我想要分析引语，以及转到作者的页面并分析他的出生日期. 我试着这样做，但都不管用.

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ["http://quotes.toscrape.com/"]

    def parse(self, response):
        
        quotes=response.xpath('//div[@class="quote"]') 
        
        item={}

        for quote in quotes: 
            item['name']=quote.xpath('.//span[@class="text"]/text()').get()
            item['author']=quote.xpath('.//small[@class="author"]/text()').get()
            item['tags']=quote.xpath('.//div[@class="tags"]/a[@class="tag"]/text()').getall()
            url=quote.xpath('.//small[@class="author"]/../a/@href').get()
            response.follow(url, self.parse_additional_page, item) 
            

        new_page=response.xpath('//li[@class="next"]/a/@href').get() 

        if new_page is not None: 

            yield response.follow(new_page,self.parse) 
            
    def parse_additional_page(self, response, item): 
        item['additional_data'] = response.xpath('//span[@class="author-born-date"]/text()').get() 
        yield item

没有出生日期的代码(正确):

import scrapy 

  

  

class QuotesSpiderSpider(scrapy.Spider): 

    name = "quotes_spider" 

    allowed_domains = ["quotes.toscrape.com"] 

    start_urls = ["https://quotes.toscrape.com/"] 

     

    def parse(self, response): 

        quotes=response.xpath('//div[@class="quote"]') 

        for quote in quotes: 

            yield { 

                'name':quote.xpath('.//span[@class="text"]/text()').get(), 

                'author':quote.xpath('.//small[@class="author"]/text()').get(), 

                'tags':quote.xpath('.//div[@class="tags"]/a[@class="tag"]/text()').getall() 

                } 

        new_page=response.xpath('//li[@class="next"]/a/@href').get() 

        if new_page is not None: 

            yield response.follow(new_page,self.parse)

问:如何访问作者的页面获取每一句引语并解析出生日期？

如何访问作者的页面获取每一句引语并解析出生日期？

class QuotesSpider(scrapy.Spider): name = "quotes" allowed_domains = ["quotes.toscrape.com"] start_urls = ["http://quotes.toscrape.com/"] def parse(self, response): for quote in response.xpath('//div[@class="quote"]'): # moving the item constructor inside the loop # means it will be unique for each item item={} item['name']=quote.xpath('.//span[@class="text"]/text()').get() item['author']=quote.xpath('.//small[@class="author"]/text()').get() item['tags']=quote.xpath('.//div[@class="tags"]/a[@class="tag"]/text()').getall() url=quote.xpath('.//small[@class="author"]/../a/@href').get() # you have to yield the request returned by response.follow yield response.follow(url, self.parse_additional_page, cb_kwargs={"item": item}) new_page=response.xpath('//li[@class="next"]/a/@href').get() if new_page is not None: yield response.follow(new_page) def parse_additional_page(self, response, item=None): item['additional_data'] = response.xpath('//span[@class="author-born-date"]/text()').get() yield item

2023-05-10 20:41:49 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/Martin-Luther-King-Jr/> {'name': '“Only in the darkness can you see the stars.”', 'author': 'Martin Luther King Jr.', 'tags': ['hope', 'inspirational'], 'additional_data': 'January 15, 1929'} 2023-05-10 20:41:49 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/C-S-Lewis/> {'name': '“You can never get a cup of tea large enough or a book long enough to suit me.”', 'author': 'C.S. Lewis', 'tags': ['books', 'inspirational', 'reading', 'tea'], 'additional_data': 'November 29, 1898'} 2023-05-10 20:41:49 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/George-R-R-Martin/> {'name': '“A reader lives a thousand lives before he dies, said Jojen. The man who never reads lives only one.”', 'author': 'George R.R. Martin', 'tags': ['read', 'readers', 'reading', 'reading-books'], 'additional_data': ' September 20, 1948'} 2023-05-10 20:41:49 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/author/James-Baldwin/> {'name': '“Love does not begin and end the way we seem to think it does. Love is a battle, love is a war; love is a growing up.”', 'author': 'James Baldwin', 'tags': ['love'], 'additional_data': 'August 02, 1924'}

Scrapy 和 Python 解析

推荐答案

Python相关问答推荐

计算相同形状的两个张量的SSE损失

根据给定日期的状态过滤查询集

将numpy数组存储在原始二进制文件中

在Python中对分层父/子列表进行排序

沿着数组中的轴计算真实条目

输出中带有南的亚麻神经网络

图像 pyramid .难以创建所需的合成图像

两个pandas的平均值按元素的结果串接元素.为什么？

在Python argparse包中添加formatter_class MetavarTypeHelpFormatter时， - help不再工作""""

导入...从...混乱

转换为浮点，pandas字符串列，混合千和十进制分隔符

如何在图中标记平均点？

如何从需要点击/切换的网页中提取表格？

使用特定值作为引用替换数据框行上的值

如何按row_id/row_number过滤数据帧

我可以不带视频系统的pygame，只用于游戏手柄输入吗？''

如果不使用. to_list()[0]，我如何从一个pandas DataFrame中获取一个值？

Groupby并在组内比较单独行上的两个时间戳

有没有一种方法可以根据不同索引集的数组从2D数组的对称子矩阵高效地构造3D数组？

ValueError：必须在Pandas 中生成聚合值