我试图循环浏览网站的多个页面(本例中有2个页面),抓取相关的客户 comments 数据,并最终组合成一个数据框架.

我遇到的挑战是,我的代码似乎在单个数据帧对象中产生了两个独立的数据帧(在所附代码中为df).我可能弄错了,但这就是我解释的方式.

下面是我上面描述的截图:

Separate data frames within single data frame object

下面是生成截图结果的代码:

from bs4 import BeautifulSoup as bs
import requests
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import json

page = 1
urls = []
while page != 3:
    url = f"https://www.trustpilot.com/review/trupanion.com?page={page}"
    urls.append(url)
    page = page + 1

for url in urls:
    response = requests.get(url)
    html = response.content
    soup = bs(html, "html.parser")
    results = soup.find(id="__NEXT_DATA__")
    json_object = json.loads(results.contents[0])
    reviews = json_object["props"]["pageProps"]["reviews"]
    ids = pd.Series([ sub['id'] for sub in reviews ])
    filtered = pd.Series([ sub['filtered'] for sub in reviews ])
    pending = pd.Series([ sub['pending'] for sub in reviews ])
    rating = pd.Series([ sub['rating'] for sub in reviews ])
    title = pd.Series([ sub['title'] for sub in reviews ])
    likes = pd.Series([ sub['likes'] for sub in reviews ])
    experienced = pd.Series([ sub['dates']['experiencedDate'] for sub in reviews ])
    published = pd.Series([ sub['dates']['publishedDate'] for sub in reviews ])
    source = url
    df = pd.DataFrame({'id': ids, 'filtered': filtered, 'pending': pending, 'rating': rating,
                   'title': title, 'likes': likes, 'experienced': experienced,
                   'published': published, 'source': source})  
    print(df)

我一直依赖这些帖子作为潜在的解决方案,没有任何运气:

Rbind, having data frames within data frames causes errors?

Analyse data frames inside a list of data frames and store all results in single data frame

Merge multiple data frames into a single data frame in python

具体来说,我一直收到以下错误:

typeerror: cannot concatenate object of type '<class 'str'>'; only series and dataframe objs are valid

肯定的'类'字符串'位是一个线索,什么问题是,但一直在旋转我的车轮,觉得我需要'go 铅笔下来',并寻求帮助.<>我对Python比较陌生,我的直觉告诉我,我需要在代码upstream 解决一些问题,以便首先避免这个问题.换句话说,虽然可能有一种方法将这两个数据帧合并成一个数据帧,但我觉得问题的根源正在出现,需要尽早解决.

推荐答案

下面是一个示例,你可以如何从多个页面中获取数据库,并作为最后一步将它们连接到最终数据库中:

import json

import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs

page = 1
urls = []
while page != 3:
    url = f"https://www.trustpilot.com/review/trupanion.com?page={page}"
    urls.append(url)
    page = page + 1

all_dfs = []
for url in urls:
    response = requests.get(url)
    html = response.content
    soup = bs(html, "html.parser")
    results = soup.find(id="__NEXT_DATA__")
    json_object = json.loads(results.contents[0])
    reviews = json_object["props"]["pageProps"]["reviews"]
    ids = pd.Series([sub["id"] for sub in reviews])
    filtered = pd.Series([sub["filtered"] for sub in reviews])
    pending = pd.Series([sub["pending"] for sub in reviews])
    rating = pd.Series([sub["rating"] for sub in reviews])
    title = pd.Series([sub["title"] for sub in reviews])
    likes = pd.Series([sub["likes"] for sub in reviews])
    experienced = pd.Series([sub["dates"]["experiencedDate"] for sub in reviews])
    published = pd.Series([sub["dates"]["publishedDate"] for sub in reviews])
    source = url
    df = pd.DataFrame(
        {
            "id": ids,
            "filtered": filtered,
            "pending": pending,
            "rating": rating,
            "title": title,
            "likes": likes,
            "experienced": experienced,
            "published": published,
            "source": source,
        }
    )
    all_dfs.append(df)

final_df = pd.concat(all_dfs)
print(final_df)

打印:

                          id  filtered  pending  rating                                                                            title  likes               experienced                 published                                                  source
0   660c4b524ff85128f3cd5665     False    False       5                                                                Amazing Insurance      0  2024-04-01T00:00:00.000Z  2024-04-02T20:15:47.000Z  https://www.trustpilot.com/review/trupanion.com?page=1
1   660b08acec6384757dfabdf9     False    False       5                                                  Enrollment was quick and easy!       0  2024-03-21T00:00:00.000Z  2024-04-01T21:19:09.000Z  https://www.trustpilot.com/review/trupanion.com?page=1
2   66098c1b0353405fb0313ae2     False    False       5                                            Extremely easy to understand website…      0  2024-03-28T00:00:00.000Z  2024-03-31T18:15:23.000Z  https://www.trustpilot.com/review/trupanion.com?page=1
3   660b1e164e75ffb01ee011f1     False    False       2                                                                   Too expensive       0  2024-04-01T00:00:00.000Z  2024-04-01T22:50:31.000Z  https://www.trustpilot.com/review/trupanion.com?page=1
4   66099003e9b2fe025035baef     False    False       5                                         The coverage seems really comprehensive…      0  2024-03-28T00:00:00.000Z  2024-03-31T18:32:04.000Z  https://www.trustpilot.com/review/trupanion.com?page=1
5   660b0af515413b0620a7d617     False    False       4                                            Everything was explained to us in an…      0  2024-03-29T00:00:00.000Z  2024-04-01T21:28:54.000Z  https://www.trustpilot.com/review/trupanion.com?page=1

...

Python相关问答推荐

使用SciPy进行曲线匹配未能给出正确的匹配

比较两个数据帧并并排附加结果(获取性能警告)

如何访问所有文件,例如环境变量

优化pytorch函数以消除for循环

如何使用它?

Pandas—合并数据帧,在公共列上保留非空值,在另一列上保留平均值

梯度下降:简化要素集的运行时间比原始要素集长

不允许访问非IPM文件夹

Django—cte给出:QuerySet对象没有属性with_cte''''

需要帮助重新调整python fill_between与数据点

AES—256—CBC加密在Python和PHP中返回不同的结果,HELPPP

未调用自定义JSON编码器

如何找出Pandas 图中的连续空值(NaN)?

干燥化与列姆化的比较

如何在Great Table中处理inf和nans

如何求相邻对序列中元素 Select 的最小代价

数据框,如果值在范围内,则获取范围和

Autocad使用pyautocad/comtypes将对象从一个图形复制到另一个图形

使用SQLAlchemy从多线程Python应用程序在postgr中插入多行的最佳方法是什么?'

PySpark:如何最有效地读取不同列位置的多个CSV文件