我正在处理BeautifulSoup的一项任务--这是一个用于抓取所有东西的令人敬畏的 Python 库.目标:我想从这个页面获取数据:https://schulfinder.kultus-bw.de注;这是一个公共页面,用于查找某一地区的所有学校.
因此,典型的数据集将如下所示:
Adresse Name
Adresse 2
Kategorie
Straße
PLZ und Ort
Tel 1
Tel 2
Mail
嗯,我想--通过使用Python,我会这样做:
首先,我必须向URL发送一个请求并获取页面的HTML内容:
url = 'https://schulfinder.kultus-bw.de'
response = requests.get(url)
html_content = response.content
之后-下一步我将创建一个BeautifulSoup对象,并找到包含学校名称的HTML元素:
soup = BeautifulSoup(html_content, 'html.parser')
schools = soup.find_all('a', {'class': 'dropdown-item'})
Extract the school names from the HTML elements and store them in a list:
school_names = [school.text.strip() for school in schools]
随后,我需要打印学校名称的列表:
print(school_names)
好的,完整的代码如下所示:
import requests
from bs4 import BeautifulSoup
url = 'https://schulfinder.kultus-bw.de'
response = requests.get(url)
html_content = response.content
soup = BeautifulSoup(html_content, 'html.parser')
schools = soup.find_all('a', {'class': 'dropdown-item'})
school_names = [school.text.strip() for school in schools]
print(school_names)
但我需要所有的数据集-
Adresse Name
Adresse 2
Kategorie
Straße
PLZ und Ort
Tel 1
Tel 2
Mail
最好的办法是以CSV-Formate格式输出它;如果我对Python更熟悉一些,那么我会运行这个小代码,并与Pandas 一起工作-我想Pandas 会更容易处理这类事情.
..
update:人查看该页面的一些图片:
update 2 i try to run this in google-colab: i get the following errors.. question: do i need to install some of the packages into collab!?
import pandas as pd
from tqdm import tqdm
from multiprocessing import Pool
from string import ascii_lowercase as chars
from itertools import product
我需要注意Google-CoLab的预科吗?!
请看errorlog that i have gotten
100%|██████████| 676/676 [00:00<00:00, 381711.03it/s]
0it [00:00, ?it/s]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/usr/local/lib/python3.9/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3360 try:
-> 3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
5 frames
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'branches'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
/usr/local/lib/python3.9/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
-> 3363 raise KeyError(key) from err
3364
3365 if is_scalar(key) and isna(key) and not self.hasnans:
KeyError: 'branches'