我们想要抓取文章(内容+标题)来扩展我们的数据集,用于文本分类目的.
目标:从>;>;https://www.bbc.com/news/technology的所有页面中删除所有文章
问题:代码似乎只从https://www.bbc.com/news/technology?page=1个页面中提取文章,尽管如此,我们遵循所有页面.我们跟着书页走会不会有问题?
class BBCSpider_2(scrapy.Spider):
name = "bbc_tech"
start_urls = ["https://www.bbc.com/news/technology"]
def parse(self, response: Response, **kwargs: Any) -> Any:
max_pages = response.xpath("//nav[@aria-label='Page']/div/div/div/div/ol/li[last()]/div/a//text()").get()
max_pages = int(max_pages)
for p in range(max_pages):
page = f"https://www.bbc.com/news/technology?page={p+1}"
yield response.follow(page, callback=self.parse_articles2)
接下来,我们将深入相应页面上的每一篇文章:
def parse_articles2(self, response):
container_to_scan = [4, 8]
for box in container_to_scan:
if box == 4:
articles = response.xpath(f"//*[@id='main-content']/div[{box}]/div/div/ul/li")
if box == 8:
articles = response.xpath(f"//*[@id='main-content']/div[{box}]/div[2]/ol/li")
for article_idx in range(len(articles)):
if box == 4:
relative_url = response.xpath(f"//*[@id='main-content']/div[4]/div/div/ul/li[{article_idx+1}]/div/div/div/div[1]/div[1]/a/@href").get()
elif box == 8:
relative_url = response.xpath(f"//*[@id='main-content']/div[8]/div[2]/ol/li[{article_idx+1}]/div/div/div[1]/div[1]/a/@href").get()
else:
relative_url = None
if relative_url is not None:
followup_url = "https://www.bbc.com" + relative_url
yield response.follow(followup_url, callback=self.parse_article)
最后但同样重要的是,我们正在整理每篇文章的内容和标题:
def parse_article(response):
article_text = response.xpath("//article/div[@data-component='text-block']")
content = []
for box in article_text:
text = box.css("div p::text").get()
if text is not None:
content.append(text)
title = response.css("h1::text").get()
yield {
"title": title,
"content": content,
}
当我们运行此命令时,我们得到的Items_scraped_count为24.但它应该是24x29+/-...