Python BeautifulSoup：超过24个字符(从a到z)的迭代失败：降低了首次深入了解数据集的复杂性：

发布于03月22日

我有一个西班牙保险公司的名单—它收集在24个标题—在一个网站上:

阿罗丹斯-埃斯帕诺: 完整榜单:https://www.unespa.es/en/directory人

共分二十四页: https://www.unespa.es/en/directory/#A https://www.unespa.es/en/directory/#Z

idea —目的是什么:我想从页面中获取数据—使用BS4和请求—并最终将其保存到一个rabrame: 嗯—使用BeautifulSoup(BS4)和Python中的请求从网站上抓取列表的任务似乎是不合适的；我认为我们需要执行以下步骤:

首先，我们需要导入必要的库:BeautifulSoup、requests和pandas. b.，那么我们需要使用requests库来获取每个感兴趣页面的HTML内容:即A到Z页面. c.然后我使用BeautifulSoup来解析HTML内容. d.随后，我认为从解析的HTML中提取相关信息(保险公司的名字)是下一步， e.最后，我想把提取的数据存储在pandas DataFrame中.

但这行不通- 也不是从A到Z的迭代:

import requests
from bs4 import BeautifulSoup
import pandas as pd

# Function to scrape insurers from a given URL
def scrape_insurers(url):
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        # Extracting insurer names
        insurers = [insurer.text.strip() for insurer in soup.find_all('h3')]
        return insurers
    else:
        print("Failed to retrieve data from", url)
        return []

# Define the base URL
base_url = "https://www.unespa.es/en/directory/"

# List to store all insurers
all_insurers = []

# Loop through each page (A to Z)
for char in range(65, 91):  # ASCII codes for A to Z
    page_url = f"{base_url}#{chr(char)}"
    insurers = scrape_insurers(page_url)
    all_insurers.extend(insurers)

# Convert the list of insurers to a pandas DataFrame
df = pd.DataFrame({'Insurer': all_insurers})

# Display the DataFrame
print(df.head())

# Save DataFrame to a CSV file
df.to_csv('insurers_spain.csv', index=False)

...失败，结果如下:

Failed to retrieve data from https://www.unespa.es/en/directory/#A
Failed to retrieve data from https://www.unespa.es/en/directory/#B
Failed to retrieve data from https://www.unespa.es/en/directory/#C
Failed to retrieve data from https://www.unespa.es/en/directory/#D
Failed to retrieve data from https://www.unespa.es/en/directory/#E

等等，

那么我觉得首先把复杂的步骤减少下来比较容易.

我认为最好是采取一个单一的URL，我想访问.最好是测试我们得到的结果与我们的请求.完成之后，现在我可以判断请求了；我想我可以使用漂亮的soup lib来判断特定的共同字段. 那么我认为我应该避免一步做三件事(这可能明显是可怕的错误).

所以我这样做第一个字符:对于A:

import requests
from bs4 import BeautifulSoup

# Function to scrape insurers from a given URL
def scrape_insurers(url):
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        # Extracting insurer names
        insurers = [insurer.text.strip() for insurer in soup.find_all('h3')]
        return insurers
    else:
        print("Failed to retrieve data from", url)
        return []

# Define the base URL
base_url = "https://www.unespa.es/en/directory/#"

# Define the character we want to fetch data for
char = 'A'

# Construct the URL for the specified character
url = base_url + char

# Fetch and print data for the specified character
insurers_char = scrape_insurers(url)
print(f"Insurers for character '{char}':")
print(insurers_char)

但请看这里的输出:

Failed to retrieve data from https://www.unespa.es/en/directory/#A
Insurers for character 'A':
[]

import pandas as pd import requests from bs4 import BeautifulSoup url = "https://www.unespa.es/en/directory/" headers = { "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:124.0) Gecko/20100101 Firefox/124.0" } soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser") data = [] for c in soup.select(".contact-item"): for t in c.select("span, a"): t.unwrap() c.smooth() title, *other = c.get_text(separator="|||", strip=True).split("|||") data.append( {"Title": title, **{(s := d.split(":", maxsplit=1))[0]: s[1] for d in other}} ) df = pd.DataFrame(data) print(df)

Title Tfno. Fax Web Dirección Email 0 A.M.A., AGRUPACIÓN MUTUAL ASEGURADORA, MUTUA DE SEGUROS APF 91 343 47 00 (91) 343 47 68 http://www.amaseguros.com VÍA DE LOS POBLADOS, 3 28033 (MADRID) NaN 1 ABANCA GENERALES DE SEGUROS Y REASEGUROS 881920742 / 881920744 NaN NaN AV. LINARES RIVAS 30, 3º 15005 A CORUÑA (A CORUÑA) NaN 2 ABANCA VIDA Y PENSIONES DE SEGUROS Y REASEGUROS, S.A. 981 188 075 NaN NaN AVENIDA DE LA MARINA, 1-3ª PLANTA 15001 A CORUÑA (A CORUÑA) NaN 3 ADMIRAL EUROPE COMPAÑIA DE SEGUROS S.A.U. (AECS) NaN NaN https://www.admiraleurope.com/ RODRÍGUEZ MARÍN, 61 - 1ª PLANTA 28016 MADRID (MADRID) NaN 4 AEGON ESPAÑA, SOCIEDAD ANÓNIMA DE SEGUROS Y REASEGUROS 91 563 62 22 NaN http://www.aegon.es VÍA DE LOS POBLADOS, 3 - EDIFICIO 4B - PARQUE EMPRESARIAL CRISTALIA 28033 (MADRID) NaN 5 AGROPELAYO SOCIEDAD DE SEGUROS, SOCIEDAD ANÓNIMA NaN NaN NaN SANTA ENGRACIA, 67 - 69 28010 (MADRID) NaN ...

Python BeautifulSoup：超过24个字符(从a到z)的迭代失败：降低了首次深入了解数据集的复杂性：

推荐答案

Python相关问答推荐

如何调整spaCy token 化器，以便在德国模型中将数字拆分为行末端的点

在内部列表上滚动窗口

如果条件为真，则Groupby.mean()

如果值发生变化，则列上的极性累积和

如何从pandas的rame类继承并使用filepath实例化

在vscode上使用Python虚拟环境时((env))

海上重叠直方图

driver. find_element无法通过class_name找到元素'""

如果满足某些条件，则用另一个数据帧列中的值填充空数据帧或数组

多指标不同顺序串联大Pandas 模型

使用groupby方法移除公共子字符串

如何在FastAPI中为我上传的json文件提供索引ID？

Flash只从html表单中获取一个值

python sklearn ValueError：使用序列设置数组元素

关于两个表达式的区别

如何为需要初始化的具体类实现依赖反转和接口分离？

Django在一个不是ForeignKey的字段上加入'

利用SCIPY沿第一轴对数组进行内插

如果服务器设置为不侦听创建，则QWebSocket客户端不连接到QWebSocketServer；如果服务器稍后开始侦听，则不连接

如何在Polars中将列表中的新列添加到现有的数据帧中？