try 使用Python并提高一些抓取技能.

以下代码应从科学文章中提取作者和从属关系信息:

def main(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'lxml')
    target = soup.find_all("span")
    
    with open("data.csv", 'w', newline="") as f:
        writer = csv.writer(f)
        writer.writerow(["Autor", "Afiliacao"])
        for tar in target:
            name = tar.find("span", attrs={'name': True})
            print(name)
            afiliacao = tar.find(class_='affiliation')
            
            writer.writerow([name, afiliacao])


main("https://rpmgf.pt/ojs/index.php/rpmgf/article/view/13494")

然而,我只能得到None分作为结果.

然而,如果我做了一些事情,

texto_soup = BeautifulSoup(texto_artigo.content,"lxml")
  autores = texto_soup.find_all('span',attrs={'class':'name'})
  afiliacao=  texto_soup.find_all('span',attrs={'class':'affiliation'})
  for a, b in zip(autores, afiliacao):
    print(a.text+b.text)

它工作(仍然需要从额外的空白和额外的换行符剥离结果). 但是...我不明白为什么第一个例子不起作用.

额外的好处是:如何解决第二个示例中的数据丢失问题?我知道我不能使用zip_longest,但在这种情况下,我会在缺少元素时得到一个错误,因为这些元素没有.text属性.

推荐答案

过滤到section,然后 Select class_:

from typing import Iterator

import bs4
from requests import Session


STRAINER = bs4.SoupStrainer(name='section', class_='item authors')


def fetch_authors(session: Session, article: int) -> bs4.ResultSet:
    with session.get(
        url=f'https://rpmgf.pt/ojs/index.php/rpmgf/article/view/{article}',
    ) as resp:
        resp.raise_for_status()
        dom = bs4.BeautifulSoup(markup=resp.text, parse_only=STRAINER, features='lxml')
    return dom.find_all(name='span', class_='name')


def main() -> None:
    with Session() as session:
        for author_tag in fetch_authors(session=session, article=13494):
            print(author_tag.text.strip())


if __name__ == '__main__':
    main()
Maria João Gonçalves
Clara Fonseca
Inês Pintalhão
Rodrigo Costa
Ana Calafate
Manuel Henriques

或者,如果您还关心从属关系:

from typing import Iterator

import bs4
from requests import Session


STRAINER = bs4.SoupStrainer(name='section', class_='item authors')


def fetch_authors(session: Session, article: int) -> Iterator[tuple[str, str | None]]:
    with session.get(
        url=f'https://rpmgf.pt/ojs/index.php/rpmgf/article/view/{article}',
    ) as resp:
        resp.raise_for_status()
        dom = bs4.BeautifulSoup(markup=resp.text, parse_only=STRAINER)  # , features='lxml')

    for name in dom.find_all(name='span', class_='name'):
        # Search through siblings for a matching affiliation tag
        for affiliation in name.find_next_siblings(name='span'):
            name_str = name.text.strip()
            class_ = affiliation.attrs.get('class', ())[0]
            if class_ == 'affiliation':
                # If we've found an affiliation class on the soonest span sibling, use it
                yield name_str, affiliation.text.strip()
                break
            elif class_ == 'name':
                # If we've encountered the next name, there is no affiliation.
                yield name_str, None
                break
        else:
            # If there are no span siblings, there is no affiliation.
            yield name.text.strip(), None


def main() -> None:
    with Session() as session:
        print('An article with some authors missing affiliation:')
        for name, affiliation in fetch_authors(session=session, article=13545):
            print(f'{name} ({affiliation})')
        print()

        print('An article with authors all having affiliation:')
        for name, affiliation in fetch_authors(session=session, article=13494):
            print(f'{name} ({affiliation})')
        print()


if __name__ == '__main__':
    main()
An article with some authors missing affiliation:
Andreia Oliveira (Médica)
Rita Paraíso (None)
Paola Lobão (None)
Vanessa Guerreiro (None)

An article with authors all having affiliation:
Maria João Gonçalves (USF Garcia de Orta)
Clara Fonseca (Assistente Graduada de Medicina Geral e Familiar na USF Garcia de Orta, ACeS Porto Ocidental)
Inês Pintalhão (Assistente de Medicina Geral e Familiar da USF Garcia de Orta, ACes Porto Ocidental)
Rodrigo Costa (Interno de Formação Específica de Medicina Geral e Familiar da USF Garcia de Orta, ACeS Porto Ocidental)
Ana Calafate (Assistente de Medicina Geral e Familiar da USF Garcia de Orta, ACeS Porto Ocidental)
Manuel Henriques (Assistente de Medicina Geral e Familiar da USF Garcia de Orta, ACeS Porto Ocidental)

Python相关问答推荐

在Python中对分层父/子列表进行排序

Odoo 14 hr. emergency.public内的二进制字段

Gekko:Spring-Mass系统的参数识别

非常奇怪:tzLocal.get_Localzone()基于python3别名的不同输出?

从numpy数组和参数创建收件箱

如何找到满足各组口罩条件的第一行?

2D空间中的反旋算法

发生异常:TclMessage命令名称无效.!listbox"

如何列举Pandigital Prime Set

如何在python polars中停止otherate(),当使用when()表达式时?

如何使用它?

在np数组上实现无重叠的二维滑动窗口

什么是合并两个embrame的最佳方法,其中一个有日期范围,另一个有日期没有任何共享列?

为什么numpy. vectorize调用vectorized函数的次数比vector中的元素要多?

如何在Python中使用Iscolc迭代器实现观察者模式?

将数字数组添加到Pandas DataFrame的单元格依赖于初始化

迭代工具组合不会输出大于3的序列

有什么方法可以在不对多索引DataFrame的列进行排序的情况下避免词法排序警告吗?

对列中的数字进行迭代,得到n次重复开始的第一个行号

Pandas:根据相邻行之间的差异过滤数据帧