这是我第一次浏览网站.问题是两个不同的表具有相同的类名.到目前为止,我已经了解到,要找到数据,我必须通过HTML标记的类名来找到它.


import bs4 as bs
from urllib.request import Request, urlopen
import pandas as pd
from pyparsing import col


req = Request('https://www.worldometers.info/world-population/albania-population/',
              headers={'User-Agent': 'Mozilla/5.0'})

webpage = urlopen(req).read()

soup = bs.BeautifulSoup(webpage, 'html5lib')


# albania population
pupulation = soup.find(class_='col-md-8 country-pop-description')
for i in pupulation.find_all('strong')[1]:
    print()
    # print(i.text, end=" ")

# getting all city populattion
city_population = soup.find(
    class_='table table-hover table-condensed table-list')
# print(city_population.text, end=" ")


# the first table
# population of albania(historical)
df = pd.DataFrame(columns=['Year', 'Population' 'Yearly Change %', 'Yearly Change', 'Migrants (net)', 'Median Age', 'Fertility Rate',
                  'Density(P/Km2)', 'Urban Pop %', 'Urban Population', "Countrys Share of Population", 'World Population', 'Albania Global Rank'])

hisoric_population = soup.find('table',
                               class_='table table-striped table-bordered table-hover table-condensed table-list')


for row in hisoric_population.tbody.find_all('tr'):
    columns = row.find_all('td')

    if (columns != []):
        Year = columns[0].text.strip()
        Population = columns[1].text.strip()
        YearlyChange_percent = columns[2].text.strip('&0')
        YearlyChange = columns[3].text.strip()
        Migrants_net = columns[4].text.strip()
        MedianAge = columns[5].text.strip('&0')
        FertilityRate = columns[6].text.strip('&0')
        Density_P_Km2 = columns[7].text.strip()
        UrbanPop_percent = columns[8].text.strip('&0')
        Urban_Population = columns[9].text.strip()
        Countrys_Share_of_Population = columns[10].text.strip('&0')
        World_Population = columns[11].text.strip()
        Albania_Global_Rank = columns[12].text.strip()

        df = df.append({'Year': Year, 'Population': Population, 'Yearly Change %': YearlyChange_percent, 'Yearly Change': YearlyChange, 'Migrants (net)': Migrants_net, 'Median Age': MedianAge, 'Fertility Rate': FertilityRate,
                        'Density(P/Km2)': Density_P_Km2, 'Urban Pop %': UrbanPop_percent, 'Countrys Share of Population': Countrys_Share_of_Population, 'World Population': World_Population, 'Albania Global Rank': Albania_Global_Rank}, ignore_index=True)
df.head()
# print(df)

#the second table
# Albania Population Forecast

forecast_population = soup.find(
    'table', class_='table table-striped table-bordered table-hover table-condensed table-list')

for row in hisoric_population.tbody.find_all('tr'):
    columns = row.find_all('td')
    print(columns)

推荐答案

如前所述,使用.find_all().当您使用.find()时,它只会返回它找到的第一个实例.find_all()将把它找到的所有实例返回到一个列表中.然后,你需要根据它的索引值计算出你想要的具体值.

另一方面,为什么不使用pandas来解析这些表呢.它在引擎盖下使用BeautifulSoup.

import requests
import pandas as pd

url = 'https://www.worldometers.info/world-population/albania-population/'
response = requests.get(url)

dfs = pd.read_html(response.text, attrs={'class':'table table-striped table-bordered table-hover table-condensed table-list'})

historic_population = dfs[0]
forecast_population = dfs[1]

Output:

print(historic_population)
    Year  Population  ... World Population  AlbaniaGlobal Rank
0   2020     2877797  ...       7794798739                 140
1   2019     2880917  ...       7713468100                 140
2   2018     2882740  ...       7631091040                 140
3   2017     2884169  ...       7547858925                 140
4   2016     2886438  ...       7464022049                 141
5   2015     2890513  ...       7379797139                 141
6   2010     2948023  ...       6956823603                 138
7   2005     3086810  ...       6541907027                 134
8   2000     3129243  ...       6143493823                 131
9   1995     3112936  ...       5744212979                 130
10  1990     3286073  ...       5327231061                 125
11  1985     2969672  ...       4870921740                 125
12  1980     2682690  ...       4458003514                 125
13  1975     2411732  ...       4079480606                 126
14  1970     2150707  ...       3700437046                 125
15  1965     1896171  ...       3339583597                 127
16  1960     1636090  ...       3034949748                 124
17  1955     1419994  ...       2773019936                 127

[18 rows x 13 columns]



print(forecast_population)
     Year  Population  ... World Population  AlbaniaGlobal Rank
0     NaN         NaN  ...              NaN                 NaN
1  2020.0   2877797.0  ...     7.794799e+09               140.0
2  2025.0   2840464.0  ...     8.184437e+09               141.0
3  2030.0   2786974.0  ...     8.548487e+09               143.0
4  2035.0   2721082.0  ...     8.887524e+09               145.0
5  2040.0   2634384.0  ...     9.198847e+09               146.0
6  2045.0   2533645.0  ...     9.481803e+09               147.0
7  2050.0   2424061.0  ...     9.735034e+09               148.0

[8 rows x 13 columns]

Python相关问答推荐

修剪Python框架中的尾随NaN值

云上Gunicorn的Flask-socketIO无法工作

Pandas 群内滚动总和

如何获取Django REST框架中序列化器内部的外卡属性?

更改Seaborn条形图中的x轴日期时间限制

如何修复使用turtle和tkinter制作的绘画应用程序的撤销功能

Google Drive API获取文件计量数据

在Python中为变量的缺失值创建虚拟值

如何在Python中使用时区夏令时获取任何给定本地时间的纪元值?

jit JAX函数中的迭代器

Python多处理:当我在一个巨大的pandas数据框架上启动许多进程时,程序就会陷入困境

难以在Manim中正确定位对象

非常奇怪:tzLocal.get_Localzone()基于python3别名的不同输出?

PMMLPipeline._ fit()需要2到3个位置参数,但给出了4个位置参数

Telethon加入私有频道

如何获得每个组的时间戳差异?

Django REST Framework:无法正确地将值注释到多对多模型,不断得到错误字段名称字段对模型无效'<><>

字符串合并语法在哪里记录

如果初始groupby找不到满足掩码条件的第一行,我如何更改groupby列,以找到它?

Python避免mypy在相互引用中从另一个类重定义类时失败