Python Scrapy(英语：Python Scrapy)(2023)

发布于11月20日

我们想要抓取文章(内容+标题)来扩展我们的数据集，用于文本分类目的.

目标:从&gt；&gt；https://www.bbc.com/news/technology的所有页面中删除所有文章

问题:代码似乎只从https://www.bbc.com/news/technology?page=1个页面中提取文章，尽管如此，我们遵循所有页面.我们跟着书页走会不会有问题？

class BBCSpider_2(scrapy.Spider):

    name = "bbc_tech"
    start_urls = ["https://www.bbc.com/news/technology"]


    def parse(self, response: Response, **kwargs: Any) -> Any:
        max_pages = response.xpath("//nav[@aria-label='Page']/div/div/div/div/ol/li[last()]/div/a//text()").get()
        max_pages = int(max_pages)
        for p in range(max_pages):
            page = f"https://www.bbc.com/news/technology?page={p+1}"
            yield response.follow(page, callback=self.parse_articles2)

接下来，我们将深入相应页面上的每一篇文章:

    def parse_articles2(self, response):
        container_to_scan = [4, 8]
        for box in container_to_scan:
            if box == 4:
                articles = response.xpath(f"//*[@id='main-content']/div[{box}]/div/div/ul/li")
            if box == 8:
                articles = response.xpath(f"//*[@id='main-content']/div[{box}]/div[2]/ol/li")
            for article_idx in range(len(articles)):
                if box == 4:
                    relative_url = response.xpath(f"//*[@id='main-content']/div[4]/div/div/ul/li[{article_idx+1}]/div/div/div/div[1]/div[1]/a/@href").get()
                elif box == 8:
                    relative_url = response.xpath(f"//*[@id='main-content']/div[8]/div[2]/ol/li[{article_idx+1}]/div/div/div[1]/div[1]/a/@href").get()
                else:
                    relative_url = None

                if relative_url is not None:
                    followup_url = "https://www.bbc.com" + relative_url
                    yield response.follow(followup_url, callback=self.parse_article)

最后但同样重要的是，我们正在整理每篇文章的内容和标题:

    def parse_article(response):
        article_text = response.xpath("//article/div[@data-component='text-block']")
        content = []
        for box in article_text:
            text = box.css("div p::text").get()
            if text is not None:
                content.append(text)

        title = response.css("h1::text").get()

        yield {
            "title": title,
            "content": content,
        }

当我们运行此命令时，我们得到的Items_scraped_count为24.但它应该是24x29+/-...

import scrapy class BBCSpider_2(scrapy.Spider): name = "bbc_tech" start_urls = ["https://www.bbc.com/news/technology"] def parse(self, response): max_pages = response.xpath("//nav[@aria-label='Page']//ol/li[last()]//text()").get() for article in response.xpath("//div[@type='article']"): if link := article.xpath(".//a[contains(@class, 'LinkPostLink')]/@href").get(): yield response.follow(link, callback=self.parse_article) for i in range(2, int(max_pages)): yield scrapy.Request(f"https://www.bbc.com/wc-data/container/topic-stream?adSlotType=mpu_middle&enableDotcomAds=true&isUk=false&lazyLoadImages=true&pageNumber={i}&pageSize=24&promoAttributionsToSuppress=%5B%22%2Fnews%22%2C%22%2Fnews%2Ffront_page%22%5D&showPagination=true&title=Latest%20News&tracking=%7B%22groupName%22%3A%22Latest%20News%22%2C%22groupType%22%3A%22topic%20stream%22%2C%22groupResourceId%22%3A%22urn%3Abbc%3Avivo%3Acuration%3Ab2790c4d-d5c4-489a-84dc-be0dcd3f5252%22%2C%22groupPosition%22%3A5%2C%22topicId%22%3A%22cd1qez2v2j2t%22%7D&urn=urn%3Abbc%3Avivo%3Acuration%3Ab2790c4d-d5c4-489a-84dc-be0dcd3f5252", callback=self.parse_json) def parse_json(self, response): for post in response.json()["posts"]: yield scrapy.Request(response.urljoin(post["url"]), callback=self.parse_article) def parse_article(self, response): article_text = response.xpath("//article/div[@data-component='text-block']//text()").getall() content = " ".join([i.strip() for i in article_text]) title = response.css("h1::text").get() yield { "title": title, "content": content, }

Python Scrapy(英语：Python Scrapy)(2023)

推荐答案

Python相关问答推荐

Python-Polars：如何用两个值的平均值填充NA？

如何从维基百科的摘要部分/链接列表中抓取链接？

在Python中是否可以输入使用任意大小参数列表的第一个元素的函数

在Python中根据id填写年份系列

我可以使用极点优化这个面向cpu的pandas代码吗？

Docker-compose：为不同项目创建相同的容器

code _tkinter. Tcl错误：窗口路径名称错误.！按钮4"

如何根据情况丢弃大Pandas 的前n行，使大Pandas 的其余部分完好无损

使用GEKKO在简单DTE系统中进行一致初始化

jit JAX函数中的迭代器

将HLS纳入媒体包

理解Python的二分库：澄清bisect_left的使用

重新匹配{ }中包含的文本，其中文本可能包含{{var}

如何在类和classy-fastapi -fastapi- followup中使用FastAPI创建路由

Python键入协议默认值

try 将一行连接到Tensorflow中的矩阵

如何将多进程池声明为变量并将其导入到另一个Python文件

python中字符串的条件替换

如何根据一列的值有条件地 Select 前N个组，然后按两列分组？

matplotlib + python foor loop