我正在构建一个代码,将检索航空公司 comments 网站上的所有 comments 标题.我使用5个不同的URL,因为我想比较5个不同的航空公司之间的标题.然而,我的代码只列出了列出的最后一个URL的 comments 标题,这是针对阿拉斯加航空公司的.我最初创建了一个将所有URL放在一起的列表,但它有完全相同的错误,只显示阿拉斯加航空公司的结果.

我的代码:

# Insert the following command into the command prompt before starting for faster run time:

# jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10

#Importing and installing necessary packages
!pip install lxml
from bs4 import BeautifulSoup
import requests
import pandas as pd
from pprint import pprint;

base_url = 'https://www.airlinequality.com/airline-reviews/'

ending = ['american-airlines', 'delta-air-lines', 'united-airlines',
           'southwest-airlines', 'alaska-airlines']

for ending in endings:
    url = base_url + ending
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'html.parser')
    results = soup.find('div', id='container')

# Retrieving all reviews
titles = results.find_all('h2', class_='text_header')

for title in titles:
    print(title, end="\n"*2)

我的输出:

<h2 class="text_header">"first class customer service"</h2>

<h2 class="text_header">"deeply unsatisfactory"</h2>

<h2 class="text_header">"Everything was just fabulous"</h2>

<h2 class="text_header">"Messed up airline"</h2>

<h2 class="text_header">"agents who obviously care so much" </h2>

<h2 class="text_header">“communication was sorely lacking”</h2>

<h2 class="text_header">"Never encountered ruder gate workers"</h2>

<h2 class="text_header">"Our check-in bag was badly damaged"</h2>

<h2 class="text_header">"never book again with Alaska Airlines"</h2>

<h2 class="text_header">"I could not get on the plane"</h2>

<h2 class="text_header">The Worlds Best Airlines</h2>

<h2 class="text_header">THE NICEST AIRPORT STAFF</h2>

<h2 class="text_header">THE CLEANEST AIRLINE</h2>

<h2 class="text_header">Alaska Airlines Photos</h2>

我希望得到这个输出,但所有5个网址.如何从所有URL中检索 comments 标题?

推荐答案

您正在覆盖每个循环中的结果-在list中存储结果以在另一个for-loop中迭代这些结果,或者直接抓取所需的信息-请注意,您只是从每个航空公司的第一个 comments 页面获得 comments ,要获得所有 comments ,您必须实现另一个loop来迭代每个航空公司的所有页面(查看示例后了解它的 idea ).

示例集中在您的操作中描述的第一页,并将结果存储在字典列表中,您只需将其转换为数据帧即可:

import requests
import pandas as pd

base_url = 'https://www.airlinequality.com/airline-reviews/'

endings = ['american-airlines', 'delta-air-lines', 'united-airlines',
           'southwest-airlines', 'alaska-airlines']

results = []
data = []

for ending in endings:
    url = base_url + ending
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'html.parser')
    for e in soup.select('article[itemprop="review"]'):
        data.append({
            'title': e.h2.text,
            'airline': ending,
            'rating':e.select_one('span[itemprop="ratingValue"]').text
        })

data看起来像:

[{'title': '"nothing but a headache"',
  'airline': 'american-airlines',
  'rating': '1'},
 {'title': '“provide vegan options”',
  'airline': 'american-airlines',
  'rating': '2'},
 {'title': '“created so much stress and hassle”',
  'airline': 'american-airlines',
  'rating': '1'},...]

data转换为数据帧:

pd.DataFrame(data)
title airline rating
0 "nothing but a headache" american-airlines 1
1 “provide vegan options” american-airlines 2
2 “created so much stress and hassle” american-airlines 1
3 "Terrible from start to finish" american-airlines 1
4 "my bags are stuck in Charlotte" american-airlines 1
...
45 “communication was sorely lacking” alaska-airlines 1
46 "Never encountered ruder gate workers" alaska-airlines 3
47 "Our check-in bag was badly damaged" alaska-airlines 1
48 "never book again with Alaska Airlines" alaska-airlines 4
49 "I could not get on the plane" alaska-airlines 1

How to get "all results"

为了让你知道如何处理所有的结果,判断一下额外的while-loop个结果在做什么-请记住,对你抓取的网站要温和,也要有一些延迟:

for ending in endings:
    url = f'https://www.airlinequality.com/airline-reviews/{ending}/page/1/?sortby=post_date%3ADesc&pagesize=100'
    while True:
        r = requests.get(url)
        soup = BeautifulSoup(r.content, 'html.parser')
        for e in soup.select('article[itemprop="review"]'):
            data.append({
                'title': e.h2.text,
                'airline': ending,
                'rating':e.select_one('span[itemprop="ratingValue"]').text
            })
        if soup.select_one('article.comp_reviews-pagination ul li:last-of-type a'):
            url = base_url + soup.select_one('article.comp_reviews-pagination ul li:last-of-type a').get('href')
        else:
            break

Python相关问答推荐

如何在vercel中指定Python运行时版本?

从 struct 类型创建MultiPolygon对象,并使用Polars列出[list[f64]列

绘制系列时如何反转轴?

自定义新元未更新参数

DuckDB将蜂巢分区插入拼花文件

具有多个选项的计数_匹配

TARete错误:类型对象任务没有属性模型'

两个pandas的平均值按元素的结果串接元素.为什么?

如何使用表达式将字符串解压缩到Polars DataFrame中的多个列中?

为什么抓取的HTML与浏览器判断的元素不同?

索引到 torch 张量,沿轴具有可变长度索引

Tkinter菜单自发添加额外项目

为什么np. exp(1000)给出溢出警告,而np. exp(—100000)没有给出下溢警告?

寻找Regex模式返回与我当前函数类似的结果

如何找出Pandas 图中的连续空值(NaN)?

导入错误:无法导入名称';操作';

Gunicorn无法启动Flask应用,因为无法将应用解析为属性名或函数调用.'"'' "

Flask运行时无法在Python中打印到控制台

Pandas—堆栈多索引头,但不包括第一列

pandas:在操作pandora之后将pandora列转换为int