我目前正在整理一个非常简单的解析器,它在memberlist上从a到z::我们这里有一个memberlist::
参见:https://vvonet.vvo.at/vvonet_mitgliederverzeichnisneu
注:我们必须打开"kontaktinformationen"链接,并将那里的数据复制给Pandas df.
我想我可以用python beautifulsoup请求来做这件事,要么把它打印到屏幕上,要么把它存储在pdf中.
首先,该脚本应该获取成员列表页面,提取到单个成员页面的链接,访问每个成员的"kontaktinformationen"页面,然后它应该提取联系信息. 最后,我认为最好将联系信息存储在DataFrame中. 好了--我终于能够将DataFrame打印到屏幕上,或者将其保存为CSV文件.
以下是我的try :
import requests
from bs4 import BeautifulSoup
import pandas as pd
# first, we send a GET request to the member list page
url = "https://vvonet.vvo.at/vvonet_mitgliederverzeichnisneu"
response = requests.get(url)
# here a check if the request was successful
if response.status_code == 200:
# Parse the HTML content of the page
soup = BeautifulSoup(response.content, "html.parser")
# Find now all member links
member_links = soup.find_all("a", class_="font1")
# now - Initialize lists to store data
member_data = []
# Iterate over member links
for member_link in member_links:
# Get the URL of the "kontaktinformationen" page
member_url = "https://vvonet.vvo.at" + member_link["href"] + "/kontaktinformationen"
# Send a GET request to the member's "kontaktinformationen" page
member_response = requests.get(member_url)
# Check if the request was successful
if member_response.status_code == 200:
# Parse the HTML content of the page
member_soup = BeautifulSoup(member_response.content, "html.parser")
# Find the contact information section
contact_info_div = member_soup.find("div", class_="contact")
# Check if contact information section exists
if contact_info_div:
# Extract the contact information
contact_info_text = contact_info_div.get_text(separator="\n", strip=True)
member_data.append(contact_info_text)
else:
member_data.append("Contact information not found")
else:
member_data.append(f"Failed to retrieve contact information for {member_link.text.strip()}")
# Create a DataFrame
df = pd.DataFrame(member_data, columns=["Contact Information"])
# Display the DataFrame
print(df)
# Alternatively, you can save the DataFrame to a CSV file
# df.to_csv("member_contact_information.csv", index=False)
else:
print("Failed to retrieve the member list page.")
但现在我得到了一个空的数据帧..
Empty DataFrame
Columns: [Contact Information]
Index: []