我的目标是从这个页面中提取所有的href链接并找到.pdf链接.我试着使用requests库和Selenium,但它们都不能提取它.
如何解决这个问题?谢谢
例如:这包含一个.pdf文件链接
This is the request code:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/113.0'}
url="https://www.bain.com/insights/topics/energy-and-natural-resources-report/"
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
for link in soup.find_all('a'):
print(link.get('href'))
This is the selenium code:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
options = webdriver.ChromeOptions()
driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=options)
page_source = driver.get("https://www.bain.com/insights/topics/energy-and-natural-resource-report/")
driver.implicitly_wait(10)
soup = BeautifulSoup(page_source, 'html.parser')
for link in soup.find_all('a'):
print(link.get('href'))
driver.quit()