我试图循环浏览网站的多个页面(本例中有2个页面),抓取相关的客户 comments 数据,并最终组合成一个数据框架.
我遇到的挑战是,我的代码似乎在单个数据帧对象中产生了两个独立的数据帧(在所附代码中为df
).我可能弄错了,但这就是我解释的方式.
下面是我上面描述的截图:
Separate data frames within single data frame object
下面是生成截图结果的代码:
from bs4 import BeautifulSoup as bs
import requests
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import json
page = 1
urls = []
while page != 3:
url = f"https://www.trustpilot.com/review/trupanion.com?page={page}"
urls.append(url)
page = page + 1
for url in urls:
response = requests.get(url)
html = response.content
soup = bs(html, "html.parser")
results = soup.find(id="__NEXT_DATA__")
json_object = json.loads(results.contents[0])
reviews = json_object["props"]["pageProps"]["reviews"]
ids = pd.Series([ sub['id'] for sub in reviews ])
filtered = pd.Series([ sub['filtered'] for sub in reviews ])
pending = pd.Series([ sub['pending'] for sub in reviews ])
rating = pd.Series([ sub['rating'] for sub in reviews ])
title = pd.Series([ sub['title'] for sub in reviews ])
likes = pd.Series([ sub['likes'] for sub in reviews ])
experienced = pd.Series([ sub['dates']['experiencedDate'] for sub in reviews ])
published = pd.Series([ sub['dates']['publishedDate'] for sub in reviews ])
source = url
df = pd.DataFrame({'id': ids, 'filtered': filtered, 'pending': pending, 'rating': rating,
'title': title, 'likes': likes, 'experienced': experienced,
'published': published, 'source': source})
print(df)
我一直依赖这些帖子作为潜在的解决方案,没有任何运气:
Rbind, having data frames within data frames causes errors?
Analyse data frames inside a list of data frames and store all results in single data frame
Merge multiple data frames into a single data frame in python
具体来说,我一直收到以下错误:
typeerror: cannot concatenate object of type '<class 'str'>'; only series and dataframe objs are valid
肯定的'类'字符串'位是一个线索,什么问题是,但一直在旋转我的车轮,觉得我需要'go 铅笔下来',并寻求帮助.<>我对Python比较陌生,我的直觉告诉我,我需要在代码upstream 解决一些问题,以便首先避免这个问题.换句话说,虽然可能有一种方法将这两个数据帧合并成一个数据帧,但我觉得问题的根源正在出现,需要尽早解决.