try 使用Python并提高一些抓取技能.

以下代码应从科学文章中提取作者和从属关系信息:

def main(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'lxml')
    target = soup.find_all("span")
    
    with open("data.csv", 'w', newline="") as f:
        writer = csv.writer(f)
        writer.writerow(["Autor", "Afiliacao"])
        for tar in target:
            name = tar.find("span", attrs={'name': True})
            print(name)
            afiliacao = tar.find(class_='affiliation')
            
            writer.writerow([name, afiliacao])


main("https://rpmgf.pt/ojs/index.php/rpmgf/article/view/13494")

然而,我只能得到None分作为结果.

然而,如果我做了一些事情,

texto_soup = BeautifulSoup(texto_artigo.content,"lxml")
  autores = texto_soup.find_all('span',attrs={'class':'name'})
  afiliacao=  texto_soup.find_all('span',attrs={'class':'affiliation'})
  for a, b in zip(autores, afiliacao):
    print(a.text+b.text)

它工作(仍然需要从额外的空白和额外的换行符剥离结果). 但是...我不明白为什么第一个例子不起作用.

额外的好处是:如何解决第二个示例中的数据丢失问题?我知道我不能使用zip_longest,但在这种情况下,我会在缺少元素时得到一个错误,因为这些元素没有.text属性.

推荐答案

过滤到section,然后 Select class_:

from typing import Iterator

import bs4
from requests import Session


STRAINER = bs4.SoupStrainer(name='section', class_='item authors')


def fetch_authors(session: Session, article: int) -> bs4.ResultSet:
    with session.get(
        url=f'https://rpmgf.pt/ojs/index.php/rpmgf/article/view/{article}',
    ) as resp:
        resp.raise_for_status()
        dom = bs4.BeautifulSoup(markup=resp.text, parse_only=STRAINER, features='lxml')
    return dom.find_all(name='span', class_='name')


def main() -> None:
    with Session() as session:
        for author_tag in fetch_authors(session=session, article=13494):
            print(author_tag.text.strip())


if __name__ == '__main__':
    main()
Maria João Gonçalves
Clara Fonseca
Inês Pintalhão
Rodrigo Costa
Ana Calafate
Manuel Henriques

或者,如果您还关心从属关系:

from typing import Iterator

import bs4
from requests import Session


STRAINER = bs4.SoupStrainer(name='section', class_='item authors')


def fetch_authors(session: Session, article: int) -> Iterator[tuple[str, str | None]]:
    with session.get(
        url=f'https://rpmgf.pt/ojs/index.php/rpmgf/article/view/{article}',
    ) as resp:
        resp.raise_for_status()
        dom = bs4.BeautifulSoup(markup=resp.text, parse_only=STRAINER)  # , features='lxml')

    for name in dom.find_all(name='span', class_='name'):
        # Search through siblings for a matching affiliation tag
        for affiliation in name.find_next_siblings(name='span'):
            name_str = name.text.strip()
            class_ = affiliation.attrs.get('class', ())[0]
            if class_ == 'affiliation':
                # If we've found an affiliation class on the soonest span sibling, use it
                yield name_str, affiliation.text.strip()
                break
            elif class_ == 'name':
                # If we've encountered the next name, there is no affiliation.
                yield name_str, None
                break
        else:
            # If there are no span siblings, there is no affiliation.
            yield name.text.strip(), None


def main() -> None:
    with Session() as session:
        print('An article with some authors missing affiliation:')
        for name, affiliation in fetch_authors(session=session, article=13545):
            print(f'{name} ({affiliation})')
        print()

        print('An article with authors all having affiliation:')
        for name, affiliation in fetch_authors(session=session, article=13494):
            print(f'{name} ({affiliation})')
        print()


if __name__ == '__main__':
    main()
An article with some authors missing affiliation:
Andreia Oliveira (Médica)
Rita Paraíso (None)
Paola Lobão (None)
Vanessa Guerreiro (None)

An article with authors all having affiliation:
Maria João Gonçalves (USF Garcia de Orta)
Clara Fonseca (Assistente Graduada de Medicina Geral e Familiar na USF Garcia de Orta, ACeS Porto Ocidental)
Inês Pintalhão (Assistente de Medicina Geral e Familiar da USF Garcia de Orta, ACes Porto Ocidental)
Rodrigo Costa (Interno de Formação Específica de Medicina Geral e Familiar da USF Garcia de Orta, ACeS Porto Ocidental)
Ana Calafate (Assistente de Medicina Geral e Familiar da USF Garcia de Orta, ACeS Porto Ocidental)
Manuel Henriques (Assistente de Medicina Geral e Familiar da USF Garcia de Orta, ACeS Porto Ocidental)

Python相关问答推荐

在Mac上安装ipython

无法定位元素错误404

Python中绕y轴曲线的旋转

如何在Python数据框架中加速序列的符号化

Pandas GroupBy可以分成两个盒子吗?

Django—cte给出:QuerySet对象没有属性with_cte''''

导入错误:无法导入名称';操作';

pandas fill和bfill基于另一列中的条件

提取最内层嵌套链接

如何在Python中从html页面中提取html链接?

Pythonquests.get(Url)返回Colab中的空内容

无法在盐流道中获得柱子

大Pandas 中的群体交叉融合

如何在不遇到IndexError的情况下将基数10的整数转换为基数80?

我的浮点问题--在C++/Python中的试用

意外的麻木图像reshape 为网格问题

Matplotlib中破碎Barh图的渐变 colored颜色

Raspberry Pi Pico W、WebSockets和从中获取数据

将新列添加到现有极点数据框中

拆分字符串,保留引用的子字符串