Python BeautifulSoup：迭代元素列表并按类提取文本

发布于02月11日

try 使用Python并提高一些抓取技能.

以下代码应从科学文章中提取作者和从属关系信息:

def main(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'lxml')
    target = soup.find_all("span")
    
    with open("data.csv", 'w', newline="") as f:
        writer = csv.writer(f)
        writer.writerow(["Autor", "Afiliacao"])
        for tar in target:
            name = tar.find("span", attrs={'name': True})
            print(name)
            afiliacao = tar.find(class_='affiliation')
            
            writer.writerow([name, afiliacao])


main("https://rpmgf.pt/ojs/index.php/rpmgf/article/view/13494")

然而，我只能得到None分作为结果.

然而，如果我做了一些事情，

texto_soup = BeautifulSoup(texto_artigo.content,"lxml")
  autores = texto_soup.find_all('span',attrs={'class':'name'})
  afiliacao=  texto_soup.find_all('span',attrs={'class':'affiliation'})
  for a, b in zip(autores, afiliacao):
    print(a.text+b.text)

它工作(仍然需要从额外的空白和额外的换行符剥离结果). 但是...我不明白为什么第一个例子不起作用.

额外的好处是:如何解决第二个示例中的数据丢失问题？我知道我不能使用zip_longest，但在这种情况下，我会在缺少元素时得到一个错误，因为这些元素没有.text属性.

from typing import Iterator import bs4 from requests import Session STRAINER = bs4.SoupStrainer(name='section', class_='item authors') def fetch_authors(session: Session, article: int) -> Iterator[tuple[str, str | None]]: with session.get( url=f'https://rpmgf.pt/ojs/index.php/rpmgf/article/view/{article}', ) as resp: resp.raise_for_status() dom = bs4.BeautifulSoup(markup=resp.text, parse_only=STRAINER) # , features='lxml') for name in dom.find_all(name='span', class_='name'): # Search through siblings for a matching affiliation tag for affiliation in name.find_next_siblings(name='span'): name_str = name.text.strip() class_ = affiliation.attrs.get('class', ())[0] if class_ == 'affiliation': # If we've found an affiliation class on the soonest span sibling, use it yield name_str, affiliation.text.strip() break elif class_ == 'name': # If we've encountered the next name, there is no affiliation. yield name_str, None break else: # If there are no span siblings, there is no affiliation. yield name.text.strip(), None def main() -> None: with Session() as session: print('An article with some authors missing affiliation:') for name, affiliation in fetch_authors(session=session, article=13545): print(f'{name} ({affiliation})') print() print('An article with authors all having affiliation:') for name, affiliation in fetch_authors(session=session, article=13494): print(f'{name} ({affiliation})') print() if __name__ == '__main__': main()

An article with some authors missing affiliation: Andreia Oliveira (Médica) Rita Paraíso (None) Paola Lobão (None) Vanessa Guerreiro (None) An article with authors all having affiliation: Maria João Gonçalves (USF Garcia de Orta) Clara Fonseca (Assistente Graduada de Medicina Geral e Familiar na USF Garcia de Orta, ACeS Porto Ocidental) Inês Pintalhão (Assistente de Medicina Geral e Familiar da USF Garcia de Orta, ACes Porto Ocidental) Rodrigo Costa (Interno de Formação Específica de Medicina Geral e Familiar da USF Garcia de Orta, ACeS Porto Ocidental) Ana Calafate (Assistente de Medicina Geral e Familiar da USF Garcia de Orta, ACeS Porto Ocidental) Manuel Henriques (Assistente de Medicina Geral e Familiar da USF Garcia de Orta, ACeS Porto Ocidental)

Python BeautifulSoup：迭代元素列表并按类提取文本

推荐答案

Python相关问答推荐

在Python中对分层父/子列表进行排序

Odoo 14 hr. emergency.public内的二进制字段

Gekko：Spring-Mass系统的参数识别

非常奇怪：tzLocal.get_Localzone()基于python3别名的不同输出？

从numpy数组和参数创建收件箱

如何找到满足各组口罩条件的第一行？

2D空间中的反旋算法

发生异常：TclMessage命令名称无效.！listbox"

如何列举Pandigital Prime Set

如何在python polars中停止otherate()，当使用when()表达式时？

如何使用它？

在np数组上实现无重叠的二维滑动窗口

什么是合并两个embrame的最佳方法，其中一个有日期范围，另一个有日期没有任何共享列？

为什么numpy. vectorize调用vectorized函数的次数比vector中的元素要多？

如何在Python中使用Iscolc迭代器实现观察者模式？

将数字数组添加到Pandas DataFrame的单元格依赖于初始化

迭代工具组合不会输出大于3的序列

有什么方法可以在不对多索引DataFrame的列进行排序的情况下避免词法排序警告吗？

对列中的数字进行迭代，得到n次重复开始的第一个行号

Pandas：根据相邻行之间的差异过滤数据帧