我正在寻找一份志愿服务的公开名单-欧洲的服务:我不需要完整的地址-但名称和网站.我想到的是数据.XML、CSV...有了这些字段:姓名、国家-以及一些额外的字段,每个存在的国家都会有一个记录.btw:欧洲志愿服务对年轻人来说是很好的 Select
我找到了一个非常好的页面,它非常非常全面--请参见
我想从欧洲网站上托管的european volunteering services个网站中收集数据:
https://youth.europa.eu/go-abroad/volunteering/opportunities_en个
我们在那里有数百个志愿服务机会--这些机会存储在如下网站中:
https://youth.europa.eu/solidarity/placement/39020_en
https://youth.europa.eu/solidarity/placement/38993_en
https://youth.europa.eu/solidarity/placement/38973_en
https://youth.europa.eu/solidarity/placement/38972_en
https://youth.europa.eu/solidarity/placement/38850_en
https://youth.europa.eu/solidarity/placement/38633_en
idea:个
我认为收集数据-即使用基于BS4
和requests
的刮刀-解析数据并随后在dataframe
中打印数据将是非常棒的
嗯-我想我们可以遍历所有的URL:
placement/39020_en
placement/38993_en
placement/38973_en
placement/38850_en
我认为我们可以在存储中从0迭代到100000,以获取存储在放置中的所有结果.但这一 idea 并没有得到代码的支持.换句话说,目前我还不知道如何在这么大的范围内迭代这个特殊的 idea :
目前我认为-这是一个基本的方法,从这个开始:
import requests
from bs4 import BeautifulSoup
import pandas as pd
# List of URLs to scrape
urls = [
"https://youth.europa.eu/solidarity/placement/39020_en",
"https://youth.europa.eu/solidarity/placement/38993_en",
"https://youth.europa.eu/solidarity/placement/38973_en",
"https://youth.europa.eu/solidarity/placement/38972_en",
"https://youth.europa.eu/solidarity/placement/38850_en",
"https://youth.europa.eu/solidarity/placement/38633_en"
]
# Function to scrape data from a single URL
def scrape_data(url):
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
# Extracting relevant data
title = soup.find("h2").text.strip()
location = soup.find("span", class_="field--name-field-placement-location").text.strip()
start_date = soup.find("span", class_="field--name-field-placement-start-date").text.strip()
end_date = soup.find("span", class_="field--name-field-placement-end-date").text.strip()
# Returning data as dictionary
return {
"Title": title,
"Location": location,
"Start Date": start_date,
"End Date": end_date,
"URL": url
}
# Scrape data from all URLs
data = []
for url in urls:
data.append(scrape_data(url))
# Convert data to DataFrame
df = pd.DataFrame(data)
# Print DataFrame
print(df)
这给我带来了以下内容
AttributeError Traceback (most recent call last)
<ipython-input-1-e65c612df65e> in <cell line: 37>()
36 data = []
37 for url in urls:
---> 38 data.append(scrape_data(url))
39
40 # Convert data to DataFrame
<ipython-input-1-e65c612df65e> in scrape_data(url)
20 # Extracting relevant data
21 title = soup.find("h2").text.strip()
---> 22 location = soup.find("span", class_="field--name-field-placement-location").text.strip()
23 start_date = soup.find("span", class_="field--name-field-placement-start-date").text.strip()
24 end_date = soup.find("span", class_="field--name-field-placement-end-date").text.strip()
AttributeError: 'NoneType' object has no attribute 'text'