目前正在研究一种收集德国保险数据的工具--这里有一份完整的数据保险 list --来自a to z家公司
我们的menbers:人
https://www.gdv.de/gdv/der-gdv/unsere-mitglieder关于478个结果的概述:
对于letter a: https://www.gdv.de/gdv/der-gdv/unsere-mitglieder?letter=A个 对于letter b名: https://www.gdv.de/gdv/der-gdv/unsere-mitglieder?letter=B
and so forth: btw: see for example one page - of a company:
https://www.gdv.de/gdv/der-gdv/unsere-mitglieder/ba-die-bayerische-allgemeine-versicherung-ag-47236
有了数据,我们需要有联系数据和地址
嗯,我认为这项任务最好使用一个带有请求的小型BS4刮板,并将所有数据放到一个数据帧中:
我使用BeautifulSoup
来解析HTML,使用Requests
来发出HTTP请求.最好的方法-是的,我想是BeautifulSoup
和Requests
来从给定的URL中提取联系数据和地址(见下文和上面).
首先,我们需要定义一个函数scrape_insurance_company
,该函数接受URL作为输入,然后向其发送一个HTTP GET请求,并使用BeautifulSoup
提取联系数据和地址.
最后,我们需要返回一个包含提取数据的词典.因为我们需要涵盖从a到z的字符,所以我们必须在这里迭代:我们迭代包含保险公司的URL列表,并 for each URL调用此函数来收集数据.随后,我们使用Pandas将数据组织到DataFrame中.
note:我在Google-CoLab上运行这个:
import requests
from bs4 import BeautifulSoup
import pandas as pd
def scrape_insurance_company(url):
# Send a GET request to the URL
response = requests.get(url)
# Check if the request was successful
if response.status_code == 200:
# Parse the HTML content
soup = BeautifulSoup(response.content, 'html.parser')
# Find all the links to insurance companies
company_links = soup.find_all('a', class_='entry-title')
# List to store the data for all insurance companies
all_data = []
# Iterate through each company link
for link in company_links:
company_url = link['href']
company_data = scrape_company_data(company_url)
if company_data:
all_data.append(company_data)
return all_data
else:
print("Failed to fetch the page:", response.status_code)
return None
def scrape_company_data(url):
# Send a GET request to the URL
response = requests.get(url)
# Check if the request was successful
if response.status_code == 200:
# Parse the HTML content
soup = BeautifulSoup(response.content, 'html.parser')
# DEBUG: Print HTML content of the page
print(soup.prettify())
# Find the relevant elements containing contact data and address
contact_info = soup.find('div', class_='contact')
address_info = soup.find('div', class_='address')
# Extract contact data and address if found
contact_data = contact_info.text.strip() if contact_info else None
address = address_info.text.strip() if address_info else None
return {'Contact Data': contact_data, 'Address': address}
else:
print("Failed to fetch the page:", response.status_code)
return None
# now we list to store data for all insurance companies
all_insurance_data = []
# and now we iterate through the alphabet
for letter in range(ord('A'), ord('Z') + 1):
letter_url = f"https://www.gdv.de/gdv/der-gdv/unsere-mitglieder?letter={chr(letter)}"
print("Scraping page:", letter_url)
data = scrape_insurance_company(letter_url)
if data:
all_insurance_data.extend(data)
# subsequently we convert the data to a Pandas DataFrame
df = pd.DataFrame(all_insurance_data)
# and finally - we save the data to a CSV file
df.to_csv('insurance_data.csv', index=False)
print("Scraping completed and data saved to 'insurance_data.csv'.")
好吧,目前一切看起来都是这样的--我进入了谷歌-CoLab-终端:
保险:
Scraping page: https://www.gdv.de/gdv/der-gdv/unsere-mitglieder?letter=A
Scraping page: https://www.gdv.de/gdv/der-gdv/unsere-mitglieder?letter=B
Scraping page: https://www.gdv.de/gdv/der-gdv/unsere-mitglieder?letter=C
Scraping page: https://www.gdv.de/gdv/der-gdv/unsere-mitglieder?letter=D
Scraping page: https://www.gdv.de/gdv/der-gdv/unsere-mitglieder?letter=Z
Scraping completed and data saved to 'insurance_data.csv'.
但名单上还是空着的.我在这里还有点挣扎