我正在处理BeautifulSoup的一项任务--这是一个用于抓取所有东西的令人敬畏的 Python 库.目标:我想从这个页面获取数据:https://schulfinder.kultus-bw.de注;这是一个公共页面,用于查找某一地区的所有学校.

因此,典型的数据集将如下所示:

Adresse Name
Adresse 2
Kategorie
Straße
PLZ und Ort
Tel 1
Tel 2
Mail 

嗯,我想--通过使用Python,我会这样做:

首先,我必须向URL发送一个请求并获取页面的HTML内容:

url = 'https://schulfinder.kultus-bw.de'
response = requests.get(url)
html_content = response.content

之后-下一步我将创建一个BeautifulSoup对象,并找到包含学校名称的HTML元素:

soup = BeautifulSoup(html_content, 'html.parser')
schools = soup.find_all('a', {'class': 'dropdown-item'})
Extract the school names from the HTML elements and store them in a list:
school_names = [school.text.strip() for school in schools]

随后,我需要打印学校名称的列表:

print(school_names)

好的,完整的代码如下所示:

import requests
from bs4 import BeautifulSoup

url = 'https://schulfinder.kultus-bw.de'
response = requests.get(url)
html_content = response.content

soup = BeautifulSoup(html_content, 'html.parser')
schools = soup.find_all('a', {'class': 'dropdown-item'})
school_names = [school.text.strip() for school in schools]

print(school_names)

但我需要所有的数据集-

Adresse Name
Adresse 2
Kategorie
Straße
PLZ und Ort
Tel 1
Tel 2
Mail 

最好的办法是以CSV-Formate格式输出它;如果我对Python更熟悉一些,那么我会运行这个小代码,并与Pandas 一起工作-我想Pandas 会更容易处理这类事情.

..

update:人查看该页面的一些图片:

enter image description here

enter image description here

update 2 i try to run this in google-colab: i get the following errors.. question: do i need to install some of the packages into collab!?

import pandas as pd
from tqdm import tqdm
from multiprocessing import Pool
from string import ascii_lowercase as chars
from itertools import product

我需要注意Google-CoLab的预科吗?!

请看errorlog that i have gotten

100%|██████████| 676/676 [00:00<00:00, 381711.03it/s]
0it [00:00, ?it/s]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/usr/local/lib/python3.9/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3360             try:
-> 3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:

5 frames
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'branches'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
/usr/local/lib/python3.9/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:
-> 3363                 raise KeyError(key) from err
   3364 
   3365         if is_scalar(key) and isna(key) and not self.hasnans:

KeyError: 'branches'

推荐答案

您可以try 这样做:当您输入aa并单击"suchen"时,服务器将返回所有包含"aa"的项目.因此,您可以try 所有组合(aaabac、...)来获取所有学校ID,然后获取所有学校的信息:

import pandas as pd
from tqdm import tqdm
from multiprocessing import Pool
from string import ascii_lowercase as chars
from itertools import product

api_url1 = 'https://schulfinder.kultus-bw.de/api/schools?distance=&outposts=1&owner=&school_kind=&term={term}&types=&work_schedule='
api_url2 = 'https://schulfinder.kultus-bw.de/api/school?uuid={uuid}'

def get_school(term):
    try:
        return requests.get(api_url1.format(term=term)).json()
    except:
        return []

def get_school_detail(uuid):
    return requests.get(api_url2.format(uuid=uuid)).json()

if __name__ == '__main__':
    l = [''.join(t) for t in product(chars, chars)]
    # you can try also to get all 3-character combinations (this will yield 4476 results (but the first step will take longer)
    # l = [''.join(t) for t in product(chars, chars, chars)]

    all_data = []
    all_uuids = set()

    with Pool(processes=8) as pool:
        for result in tqdm(pool.imap_unordered(get_school, l), total=len(l)):
            for item in result:
                all_uuids.add(item['uuid'])

    with Pool(processes=16) as pool:
        for r in tqdm(pool.imap_unordered(get_school_detail, all_uuids), total=len(all_uuids)):
            all_data.append(r)

    df = pd.DataFrame(all_data)

    df = df.explode('branches')
    df = df.explode('trades')
    df = pd.concat([df, df.pop('branches').apply(pd.Series).add_prefix('branch_')], axis=1)
    df = pd.concat([df, df.pop('trades').apply(pd.Series).add_prefix('trade_')], axis=1)

    print(df.head())

    df.to_csv('data.csv', index=False)

这将获得所有4461所学校的信息,并将数据保存到data.csv所学校:

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 676/676 [00:38<00:00, 17.63it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4461/4461 [00:22<00:00, 194.86it/s]
  outpost_number                                                 name               street house_number postcode        city            phone              fax                              email                                   website tablet_tranche tablet_platform tablet_branches tablet_trades       lat      lng  official  branch_branch_id branch_acronym branch_description_long  trade_0 trade_trade_id trade_description
0              0  Schule am Schlosspark Realschule und Werkrealschule  Schussenrieder Str.           25    88326   Aulendorf   +4975259238102   +4975259238104  poststelle@04160556.schule.bwl.de         http://www.schuleamschlosspark.de           None            None            None          None  47.95760  9.63881         0             15110             RS              Realschule      NaN            NaN               NaN
0              0  Schule am Schlosspark Realschule und Werkrealschule  Schussenrieder Str.           25    88326   Aulendorf   +4975259238102   +4975259238104  poststelle@04160556.schule.bwl.de         http://www.schuleamschlosspark.de           None            None            None          None  47.95760  9.63881         0             14210            WRS          Werkrealschule      NaN            NaN               NaN
1              0              Schauenburg-Schule Grundschule Urloffen      Schauenburgstr.            4    77767  Appenweier     +49780597236    +497805914396  poststelle@04155676.schule.bwl.de  http://www.schauenburgschule-urloffen.de           None            None            None          None  48.56460  7.97361         0             12110             GS             Grundschule      NaN            NaN               NaN
2              0                      Klosterwiesenschule Grundschule            Boschstr.            1    88255      Baindt  +49750294114132  +49750294114139  poststelle@04139725.schule.bwl.de               http://www.baindt.de/schule           None            None            None          None  47.84319  9.65829         0             12110             GS             Grundschule      NaN            NaN               NaN
3              0                       Montessori-Grundschule Nußdorf          Zum Laugele            7    88662  Überlingen     +49755165620             None  poststelle@04117742.schule.bwl.de        http://www.grundschule-nussdorf.de           None            None            None          None  47.75325  9.19516         0             12110             GS             Grundschule      NaN            NaN               NaN

...

LibreOffice截图:

enter image description here

Python相关问答推荐

如何将Matplotlib的fig.add_axes本地坐标与我的坐标关联起来?

ambda将时间戳与组内另一列的所有时间戳进行比较

如何计算两极打印机中 * 所有列 * 的出现次数?

根据不同列的值在收件箱中移动数据

由于NEP 50,向uint 8添加-256的代码是否会在numpy 2中失败?

如何在Windows上用Python提取名称中带有逗号的文件?

Excel图表-使用openpyxl更改水平轴与Y轴相交的位置(Python)

如何在Django基于类的视图中有效地使用UTE和RST HTIP方法?

如何从.cgi网站刮一张表到rame?

Streamlit应用程序中的Plotly条形图中未正确显示Y轴刻度

如何创建一个缓冲区周围的一行与manim?

如何从数据库上传数据到html?

如何在UserSerializer中添加显式字段?

SQLAlchemy bindparam在mssql上失败(但在mysql上工作)

Django admin Csrf令牌未设置

循环浏览每个客户记录,以获取他们来自的第一个/最后一个渠道

Polars map_使用多处理对UDF进行批处理

计算机找不到已安装的库'

有了Gekko,可以创建子模型或将模型合并在一起吗?

如何在python tkinter中绑定键盘上的另一个回车?