我正在try 刮这个网站:https://www.globusmedical.com/patient-education-musculoskeletal-system-conditions/resources/find-a-surgeon/

网站似乎使用了JavaScript,所以当我加载判断代码时,我看不到列出doctor 的表格.但是,当您具体判断元素时,它与所有信息一起存在.

我try 多次点击"加载更多"按钮,直到它消失,然后用BeautifulSoup解析页面.

有人能帮我解释一下为什么正在打印的page_source没有显示任何信息吗?有没有一个while循环,你会设置来完成点击"加载更多"直到它消失?

from bs4 import BeautifulSoup
from selenium import webdriver 
from selenium.webdriver.common.by import By
from selenium.webdriver import ActionChains
import time
import requests

doctor_dict = {}

#configure webdriver

options = webdriver.ChromeOptions()
options.add_argument("--headless=new")
driver = webdriver.Chrome(options = options)

driver.get("https://www.globusmedical.com/patient-education-musculoskeletal-system-conditions/resources/find-a-surgeon/")
time.sleep(5)
clickable = driver.find_element(By.XPATH,'//button[@class="js-eml-load-more-button eml-load-more-button eml-btn btn btn--primary"]')

driver.execute_script("arguments[0].click();", clickable)
# items = driver.find_element(By.CLASS_NAME,"eml-location grid--item")

soup = BeautifulSoup(driver.page_source, 'html.parser')

print(soup.prettify())
driver.quit()

推荐答案

有一个API调用来做同样的事情.

接口:https://www.globusmedical.com/wp-json/em-locator/v1/locations/?page=1

import requests
from lxml import etree
import json

def get_response_using_headers(page):
    headers = {
            'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
            'accept-language': 'en-GB,en;q=0.9',
            'sec-ch-ua': '"Chromium";v="123", "Not:A-Brand";v="8"',
            'sec-ch-ua-mobile': '?0',
            'sec-ch-ua-platform': '"Linux"',
            'sec-fetch-dest': 'document',
            'sec-fetch-mode': 'navigate',
            'sec-fetch-site': 'none',
            'sec-fetch-user': '?1',
            'upgrade-insecure-requests': '1',
            'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36',
            }

    params = {
            'page': page,
            }

    response = requests.get('https://www.globusmedical.com/wp-json/em-locator/v1/locations/', params=params, headers=headers)
    return response

# api_url = "https://www.globusmedical.com/wp-json/em-locator/v1/locations/?page=1"
doctors_data = []

# you can change the pagination range here
for page in range(1,2):
    response = get_response_using_headers(page)
    data = response.json()
    for item in data:
        info = {}
        info["name"] = item["post"]["post_title"]
        info["location"] = item["name"]
        address = item["formatted_address"]
        if address == "":
            address = item["address"]
        info["address"] = address
        info["phone"] = item["phone"]
        info["email"] = item["email"]
        html_str = item["list_item_html"]
        dom = etree.HTML(html_str)
        info["specialty"] = dom.xpath("//div[contains(@class,'procedures')]/text()")[0].strip()

        doctors_data.append(info)

print(doctors_data)

OUTPUT:

[{'name': 'Aaron Dumont', 'location': 'Tulane University Medical Center', 'address': '1415 Tulane Avenue, Fifth Floor, Neuroscience Center, New Orleans, LA, 70112', 'phone': '(504) 988-5565', 'email': 'adumont2@tulane.edu', 'specialty': 'Adult Deformity, Degenerative Spine, Minimally Invasive Surgery, Robotic Spine Surgery'},
{'name': 'Aaron Greenberg', 'location': 'Hackensack UMC Pascack Valley', 'address': '784 Franklin Avenue', 'phone': '(844) 777-0910', 'email': 'Ajberg214@gmail.com', 'specialty': 'Adult Deformity, Degenerative Spine, Minimally Invasive Surgery'},
.
.
.
{'name': 'Albert Wong', 'location': 'Cedars Sinai and DOCS Health', 'address': '8436 W 3rd St suite 800, Los Angeles, CA, USA', 'phone': '(310) 746-5918', 'email': 'AW@docshealth.com', 'specialty': 'Adult Deformity, Cervical Artificial Disc, Degenerative Spine, Minimally Invasive Surgery, Robotic Spine Surgery, Sacroiliac Joint Fusion, Spinal Deformity, Surgery of the Neck & Back'}]

Python相关问答推荐

将HTML输出转换为表格中的问题

Python中使用时区感知日期时间对象进行时间算术的Incredit

Pandas 在最近的日期合并,考虑到破产

为什么tkinter框架没有被隐藏?

Matlab中是否有Python的f-字符串等效物

如何在solve()之后获得症状上的等式的值

组/群集按字符串中的子字符串或子字符串中的字符串轮询数据框

使用密钥字典重新配置嵌套字典密钥名

改进大型数据集的框架性能

Django RawSQL注释字段

在Python 3中,如何让客户端打开一个套接字到服务器,发送一行JSON编码的数据,读回一行JSON编码的数据,然后继续?

Flash只从html表单中获取一个值

Python—转换日期:价目表到新行

从列表中获取n个元素,其中list [i][0]== value''

Flask运行时无法在Python中打印到控制台

Python将一个列值分割成多个列,并保持其余列相同

在我融化极点数据帧之后,我如何在不添加索引的情况下将其旋转回其原始形式?

pytest、xdist和共享生成的文件依赖项

按列表分组到新列中

如何在Polars中将列表中的新列添加到现有的数据帧中?