因此,我正在try 编写一种代码,在链接中进行Web裁剪,并检测PDF文件,这些数据将与这些信息一起形成数据帧.我的问题是:我想更新代码,代码只检测带有最终扩展名".pdf"的链接.我怎么能让他在没有扩展名的链接中检测到pdf文件? 我的代码是:

import requests
from bs4 import BeautifulSoup
import pandas as pd


url = "https://machado.mec.gov.br/obra-completa-lista/itemlist/category/24-conto"


response = requests.get(url)


if response.status_code == 200:
    soup = BeautifulSoup(response.content, "html.parser")
    links = soup.find_all("a")
    results = []
    for link in links:
        href = link.get("href")
        if href is not None:
            file_url = url + href  
            file_response = requests.head(file_url) 
            content_type = file_response.headers.get("Content-Type")
            is_pdf = content_type == 'application/pdf' or href.lower().endswith('.pdf')

            status = file_response.status_code
            if status == 404:  # Verifica se o status é 404 (Not Found)
                results.append({"Link": file_url, "Status": status, "Arquivo": href, "PDF": is_pdf})
            else:
                results.append({"Link": file_url, "Status": status, "Arquivo": href, "PDF": is_pdf})

    
    df = pd.DataFrame(results)
    df

else:
    print("Fail", response.status_code) 

我编写了代码,他运行正常,但我想提高他的水平.

推荐答案

对于file_url = url + href,您正在构造服务器上不存在的URL.try 解析包含download的链接,并仅将域添加到URL(base_url):

import requests
import pandas as pd
from bs4 import BeautifulSoup


url = "https://machado.mec.gov.br/obra-completa-lista/itemlist/category/24-conto"
base_url = 'https://machado.mec.gov.br'

response = requests.get(url)

if response.status_code == 200:
    soup = BeautifulSoup(response.content, "html.parser")
    links = soup.select("a[href*=download]")
    results = []
    for link in links:
        href = link["href"]

        if href.startswith('http'):
            file_url = href
        else:
            file_url = base_url + href

        file_response = requests.head(file_url)
        content_type = file_response.headers.get("Content-Type")
        is_pdf = content_type == 'application/pdf' or href.lower().endswith('.pdf')
        status = file_response.status_code
        if status == 404:  # Verifica se o status é 404 (Not Found)
            results.append({"Link": file_url, "Status": status, "Arquivo": href, "PDF": is_pdf})
        else:
            results.append({"Link": file_url, "Status": status, "Arquivo": href, "PDF": is_pdf})

    df = pd.DataFrame(results)
    print(df)

else:
    print("Fail", response.status_code)

打印:

                                                                                               Link  Status                                                                 Arquivo   PDF
0  https://machado.mec.gov.br/obra-completa-lista/item/download/31_15b64419a44a2b6ba9781ae001275ae8     200  /obra-completa-lista/item/download/31_15b64419a44a2b6ba9781ae001275ae8  True
1  https://machado.mec.gov.br/obra-completa-lista/item/download/30_8e623caa384980ca20f48a66e691074f     200  /obra-completa-lista/item/download/30_8e623caa384980ca20f48a66e691074f  True
2  https://machado.mec.gov.br/obra-completa-lista/item/download/29_008edfdf58623bb13d27157722a7281e     200  /obra-completa-lista/item/download/29_008edfdf58623bb13d27157722a7281e  True
3  https://machado.mec.gov.br/obra-completa-lista/item/download/28_b10fd1f9a75bcaa4573e55e677660131     200  /obra-completa-lista/item/download/28_b10fd1f9a75bcaa4573e55e677660131  True
4  https://machado.mec.gov.br/obra-completa-lista/item/download/26_29eaa69154e158508ef8374fcb50937a     200  /obra-completa-lista/item/download/26_29eaa69154e158508ef8374fcb50937a  True
5  https://machado.mec.gov.br/obra-completa-lista/item/download/25_fcddef9a9bd325ad2003c64f4f4eb884     200  /obra-completa-lista/item/download/25_fcddef9a9bd325ad2003c64f4f4eb884  True
6  https://machado.mec.gov.br/obra-completa-lista/item/download/24_938f74988ddbf449047ecc5c5b575985     200  /obra-completa-lista/item/download/24_938f74988ddbf449047ecc5c5b575985  True

Python相关问答推荐

如何使用矩阵在sklearn中同时对每个列执行matthews_corrcoef?

将DF中的名称与另一DF拆分并匹配并返回匹配的公司

大Pandas 胚胎中产生组合

pandas DataFrame GroupBy.diff函数的意外输出

如何使用symy打印方程?

标题:如何在Python中使用嵌套饼图可视化分层数据?

海运图:调整行和列标签

在线条上绘制表面

Pandas—在数据透视表中占总数的百分比

无法在Docker内部运行Python的Matlab SDK模块,但本地没有问题

如何根据一列的值有条件地 Select 前N个组,然后按两列分组?

Pandas Loc Select 到NaN和值列表

当点击tkinter菜单而不是菜单选项时,如何执行命令?

matplotlib + python foor loop

基于另一列的GROUP-BY聚合将列添加到Polars LazyFrame

使用Openpyxl从Excel中的折线图更改图表样式

Cython无法识别Numpy类型

有没有办法在不先将文件写入内存的情况下做到这一点?

没有内置pip模块的Python3.11--S在做什么?

在我融化极点数据帧之后,我如何在不添加索引的情况下将其旋转回其原始形式?