如何使用BeautifulSoup从多个URL检索数据(Python只返回最后一行)

发布于01月07日

我正在构建一个代码，将检索航空公司 comments 网站上的所有 comments 标题.我使用5个不同的URL，因为我想比较5个不同的航空公司之间的标题.然而，我的代码只列出了列出的最后一个URL的 comments 标题，这是针对阿拉斯加航空公司的.我最初创建了一个将所有URL放在一起的列表，但它有完全相同的错误，只显示阿拉斯加航空公司的结果.

我的代码:

# Insert the following command into the command prompt before starting for faster run time:

# jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10

#Importing and installing necessary packages
!pip install lxml
from bs4 import BeautifulSoup
import requests
import pandas as pd
from pprint import pprint;

base_url = 'https://www.airlinequality.com/airline-reviews/'

ending = ['american-airlines', 'delta-air-lines', 'united-airlines',
           'southwest-airlines', 'alaska-airlines']

for ending in endings:
    url = base_url + ending
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'html.parser')
    results = soup.find('div', id='container')

# Retrieving all reviews
titles = results.find_all('h2', class_='text_header')

for title in titles:
    print(title, end="\n"*2)

我的输出:

<h2 class="text_header">"first class customer service"</h2>

<h2 class="text_header">"deeply unsatisfactory"</h2>

<h2 class="text_header">"Everything was just fabulous"</h2>

<h2 class="text_header">"Messed up airline"</h2>

<h2 class="text_header">"agents who obviously care so much" </h2>

<h2 class="text_header">“communication was sorely lacking”</h2>

<h2 class="text_header">"Never encountered ruder gate workers"</h2>

<h2 class="text_header">"Our check-in bag was badly damaged"</h2>

<h2 class="text_header">"never book again with Alaska Airlines"</h2>

<h2 class="text_header">"I could not get on the plane"</h2>

<h2 class="text_header">The Worlds Best Airlines</h2>

<h2 class="text_header">THE NICEST AIRPORT STAFF</h2>

<h2 class="text_header">THE CLEANEST AIRLINE</h2>

<h2 class="text_header">Alaska Airlines Photos</h2>

我希望得到这个输出，但所有5个网址.如何从所有URL中检索 comments 标题？

import requests import pandas as pd base_url = 'https://www.airlinequality.com/airline-reviews/' endings = ['american-airlines', 'delta-air-lines', 'united-airlines', 'southwest-airlines', 'alaska-airlines'] results = [] data = [] for ending in endings: url = base_url + ending r = requests.get(url) soup = BeautifulSoup(r.content, 'html.parser') for e in soup.select('article[itemprop="review"]'): data.append({ 'title': e.h2.text, 'airline': ending, 'rating':e.select_one('span[itemprop="ratingValue"]').text })

[{'title': '"nothing but a headache"', 'airline': 'american-airlines', 'rating': '1'}, {'title': '“provide vegan options”', 'airline': 'american-airlines', 'rating': '2'}, {'title': '“created so much stress and hassle”', 'airline': 'american-airlines', 'rating': '1'},...]

	title	airline	rating
0	"nothing but a headache"	american-airlines	1
1	“provide vegan options”	american-airlines	2
2	“created so much stress and hassle”	american-airlines	1
3	"Terrible from start to finish"	american-airlines	1
4	"my bags are stuck in Charlotte"	american-airlines	1
...
45	“communication was sorely lacking”	alaska-airlines	1
46	"Never encountered ruder gate workers"	alaska-airlines	3
47	"Our check-in bag was badly damaged"	alaska-airlines	1
48	"never book again with Alaska Airlines"	alaska-airlines	4
49	"I could not get on the plane"	alaska-airlines	1

title

airline

rating

"nothing but a headache"

american-airlines

“provide vegan options”

american-airlines

“created so much stress and hassle”

american-airlines

"Terrible from start to finish"

american-airlines

"my bags are stuck in Charlotte"

american-airlines

...

“communication was sorely lacking”

alaska-airlines

"Never encountered ruder gate workers"

alaska-airlines

"Our check-in bag was badly damaged"

alaska-airlines

"never book again with Alaska Airlines"

alaska-airlines

"I could not get on the plane"

alaska-airlines

for ending in endings: url = f'https://www.airlinequality.com/airline-reviews/{ending}/page/1/?sortby=post_date%3ADesc&pagesize=100' while True: r = requests.get(url) soup = BeautifulSoup(r.content, 'html.parser') for e in soup.select('article[itemprop="review"]'): data.append({ 'title': e.h2.text, 'airline': ending, 'rating':e.select_one('span[itemprop="ratingValue"]').text }) if soup.select_one('article.comp_reviews-pagination ul li:last-of-type a'): url = base_url + soup.select_one('article.comp_reviews-pagination ul li:last-of-type a').get('href') else: break

如何使用BeautifulSoup从多个URL检索数据(Python只返回最后一行)

推荐答案

How to get "all results"

Python相关问答推荐

如何在vercel中指定Python运行时版本？

从 struct 类型创建MultiPolygon对象，并使用Polars列出[list[f64]列

绘制系列时如何反转轴？

自定义新元未更新参数

DuckDB将蜂巢分区插入拼花文件

具有多个选项的计数_匹配

TARete错误：类型对象任务没有属性模型'

两个pandas的平均值按元素的结果串接元素.为什么？

如何使用表达式将字符串解压缩到Polars DataFrame中的多个列中？

为什么抓取的HTML与浏览器判断的元素不同？

索引到 torch 张量，沿轴具有可变长度索引

Tkinter菜单自发添加额外项目

为什么np. exp(1000)给出溢出警告，而np. exp(—100000)没有给出下溢警告？

寻找Regex模式返回与我当前函数类似的结果

如何找出Pandas 图中的连续空值(NaN)？

导入错误：无法导入名称'；操作'；

Gunicorn无法启动Flask应用，因为无法将应用解析为属性名或函数调用.'"'' "

Flask运行时无法在Python中打印到控制台

Pandas—堆栈多索引头，但不包括第一列

pandas：在操作pandora之后将pandora列转换为int