Python 利用Selenium和Beautiful Soup实现Web抓取JavaScript表

发布于04月07日

我正在try 刮这个网站:https://www.globusmedical.com/patient-education-musculoskeletal-system-conditions/resources/find-a-surgeon/

网站似乎使用了JavaScript，所以当我加载判断代码时，我看不到列出doctor 的表格.但是，当您具体判断元素时，它与所有信息一起存在.

我try 多次点击"加载更多"按钮，直到它消失，然后用BeautifulSoup解析页面.

有人能帮我解释一下为什么正在打印的page_source没有显示任何信息吗？有没有一个while循环，你会设置来完成点击"加载更多"直到它消失？

from bs4 import BeautifulSoup
from selenium import webdriver 
from selenium.webdriver.common.by import By
from selenium.webdriver import ActionChains
import time
import requests

doctor_dict = {}

#configure webdriver

options = webdriver.ChromeOptions()
options.add_argument("--headless=new")
driver = webdriver.Chrome(options = options)

driver.get("https://www.globusmedical.com/patient-education-musculoskeletal-system-conditions/resources/find-a-surgeon/")
time.sleep(5)
clickable = driver.find_element(By.XPATH,'//button[@class="js-eml-load-more-button eml-load-more-button eml-btn btn btn--primary"]')

driver.execute_script("arguments[0].click();", clickable)
# items = driver.find_element(By.CLASS_NAME,"eml-location grid--item")

soup = BeautifulSoup(driver.page_source, 'html.parser')

print(soup.prettify())
driver.quit()

import requests from lxml import etree import json def get_response_using_headers(page): headers = { 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7', 'accept-language': 'en-GB,en;q=0.9', 'sec-ch-ua': '"Chromium";v="123", "Not:A-Brand";v="8"', 'sec-ch-ua-mobile': '?0', 'sec-ch-ua-platform': '"Linux"', 'sec-fetch-dest': 'document', 'sec-fetch-mode': 'navigate', 'sec-fetch-site': 'none', 'sec-fetch-user': '?1', 'upgrade-insecure-requests': '1', 'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36', } params = { 'page': page, } response = requests.get('https://www.globusmedical.com/wp-json/em-locator/v1/locations/', params=params, headers=headers) return response # api_url = "https://www.globusmedical.com/wp-json/em-locator/v1/locations/?page=1" doctors_data = [] # you can change the pagination range here for page in range(1,2): response = get_response_using_headers(page) data = response.json() for item in data: info = {} info["name"] = item["post"]["post_title"] info["location"] = item["name"] address = item["formatted_address"] if address == "": address = item["address"] info["address"] = address info["phone"] = item["phone"] info["email"] = item["email"] html_str = item["list_item_html"] dom = etree.HTML(html_str) info["specialty"] = dom.xpath("//div[contains(@class,'procedures')]/text()")[0].strip() doctors_data.append(info) print(doctors_data)

[{'name': 'Aaron Dumont', 'location': 'Tulane University Medical Center', 'address': '1415 Tulane Avenue, Fifth Floor, Neuroscience Center, New Orleans, LA, 70112', 'phone': '(504) 988-5565', 'email': 'adumont2@tulane.edu', 'specialty': 'Adult Deformity, Degenerative Spine, Minimally Invasive Surgery, Robotic Spine Surgery'}, {'name': 'Aaron Greenberg', 'location': 'Hackensack UMC Pascack Valley', 'address': '784 Franklin Avenue', 'phone': '(844) 777-0910', 'email': 'Ajberg214@gmail.com', 'specialty': 'Adult Deformity, Degenerative Spine, Minimally Invasive Surgery'}, . . . {'name': 'Albert Wong', 'location': 'Cedars Sinai and DOCS Health', 'address': '8436 W 3rd St suite 800, Los Angeles, CA, USA', 'phone': '(310) 746-5918', 'email': 'AW@docshealth.com', 'specialty': 'Adult Deformity, Cervical Artificial Disc, Degenerative Spine, Minimally Invasive Surgery, Robotic Spine Surgery, Sacroiliac Joint Fusion, Spinal Deformity, Surgery of the Neck & Back'}]

Python 利用Selenium和Beautiful Soup实现Web抓取JavaScript表

推荐答案

Python相关问答推荐

将HTML输出转换为表格中的问题

Python中使用时区感知日期时间对象进行时间算术的Incredit

Pandas 在最近的日期合并，考虑到破产

为什么tkinter框架没有被隐藏？

Matlab中是否有Python的f-字符串等效物

如何在solve()之后获得症状上的等式的值

组/群集按字符串中的子字符串或子字符串中的字符串轮询数据框

使用密钥字典重新配置嵌套字典密钥名

改进大型数据集的框架性能

Django RawSQL注释字段

在Python 3中，如何让客户端打开一个套接字到服务器，发送一行JSON编码的数据，读回一行JSON编码的数据，然后继续？

Flash只从html表单中获取一个值

Python—转换日期：价目表到新行

从列表中获取n个元素，其中list [i][0]== value''

Flask运行时无法在Python中打印到控制台

Python将一个列值分割成多个列，并保持其余列相同

在我融化极点数据帧之后，我如何在不添加索引的情况下将其旋转回其原始形式？

pytest、xdist和共享生成的文件依赖项

按列表分组到新列中

如何在Polars中将列表中的新列添加到现有的数据帧中？