因此,我正在try 编写一种代码,在链接中进行Web裁剪,并检测PDF文件,这些数据将与这些信息一起形成数据帧.我的问题是:我想更新代码,代码只检测带有最终扩展名".pdf"的链接.我怎么能让他在没有扩展名的链接中检测到pdf文件? 我的代码是:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://machado.mec.gov.br/obra-completa-lista/itemlist/category/24-conto"
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.content, "html.parser")
links = soup.find_all("a")
results = []
for link in links:
href = link.get("href")
if href is not None:
file_url = url + href
file_response = requests.head(file_url)
content_type = file_response.headers.get("Content-Type")
is_pdf = content_type == 'application/pdf' or href.lower().endswith('.pdf')
status = file_response.status_code
if status == 404: # Verifica se o status é 404 (Not Found)
results.append({"Link": file_url, "Status": status, "Arquivo": href, "PDF": is_pdf})
else:
results.append({"Link": file_url, "Status": status, "Arquivo": href, "PDF": is_pdf})
df = pd.DataFrame(results)
df
else:
print("Fail", response.status_code)
我编写了代码,他运行正常,但我想提高他的水平.