我有一个西班牙保险公司的名单—它收集在24个标题—在一个网站上:
阿罗丹斯-埃斯帕诺: 完整榜单:https://www.unespa.es/en/directory人
共分二十四页: https://www.unespa.es/en/directory/#A https://www.unespa.es/en/directory/#Z
idea —目的是什么:我想从页面中获取数据—使用BS4和请求—并最终将其保存到一个rabrame: 嗯—使用BeautifulSoup(BS4)和Python中的请求从网站上抓取列表的任务似乎是不合适的;我认为我们需要执行以下步骤:
首先,我们需要导入必要的库:BeautifulSoup、requests和pandas. b.,那么我们需要使用requests库来获取每个感兴趣页面的HTML内容:即A到Z页面. c.然后我使用BeautifulSoup来解析HTML内容. d.随后,我认为从解析的HTML中提取相关信息(保险公司的名字)是下一步, e.最后,我想把提取的数据存储在pandas DataFrame中.
但这行不通- 也不是从A到Z的迭代:
import requests
from bs4 import BeautifulSoup
import pandas as pd
# Function to scrape insurers from a given URL
def scrape_insurers(url):
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
# Extracting insurer names
insurers = [insurer.text.strip() for insurer in soup.find_all('h3')]
return insurers
else:
print("Failed to retrieve data from", url)
return []
# Define the base URL
base_url = "https://www.unespa.es/en/directory/"
# List to store all insurers
all_insurers = []
# Loop through each page (A to Z)
for char in range(65, 91): # ASCII codes for A to Z
page_url = f"{base_url}#{chr(char)}"
insurers = scrape_insurers(page_url)
all_insurers.extend(insurers)
# Convert the list of insurers to a pandas DataFrame
df = pd.DataFrame({'Insurer': all_insurers})
# Display the DataFrame
print(df.head())
# Save DataFrame to a CSV file
df.to_csv('insurers_spain.csv', index=False)
...失败,结果如下:
Failed to retrieve data from https://www.unespa.es/en/directory/#A
Failed to retrieve data from https://www.unespa.es/en/directory/#B
Failed to retrieve data from https://www.unespa.es/en/directory/#C
Failed to retrieve data from https://www.unespa.es/en/directory/#D
Failed to retrieve data from https://www.unespa.es/en/directory/#E
等等,
那么我觉得首先把复杂的步骤减少下来比较容易.
我认为最好是采取一个单一的URL,我想访问.最好是测试我们得到的结果与我们的请求.完成之后,现在我可以判断请求了;我想我可以使用漂亮的soup lib来判断特定的共同字段. 那么我认为我应该避免一步做三件事(这可能明显是可怕的错误).
所以我这样做第一个字符:对于A:
import requests
from bs4 import BeautifulSoup
# Function to scrape insurers from a given URL
def scrape_insurers(url):
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
# Extracting insurer names
insurers = [insurer.text.strip() for insurer in soup.find_all('h3')]
return insurers
else:
print("Failed to retrieve data from", url)
return []
# Define the base URL
base_url = "https://www.unespa.es/en/directory/#"
# Define the character we want to fetch data for
char = 'A'
# Construct the URL for the specified character
url = base_url + char
# Fetch and print data for the specified character
insurers_char = scrape_insurers(url)
print(f"Insurers for character '{char}':")
print(insurers_char)
但请看这里的输出:
Failed to retrieve data from https://www.unespa.es/en/directory/#A
Insurers for character 'A':
[]