有3821个链接,但他们只给了我103个链接,我也对链接应用了window.scroll的条件,但它们都不起作用

    from selenium.webdriver.common.by import By
    from selenium.webdriver.common.keys import Keys
    from selenium.webdriver.common.action_chains import ActionChains
    from selenium.webdriver.common.keys import Keys
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    import time
    import ssl
    import undetected_chromedriver as uc
    import requests
    from bs4 import BeautifulSoup
    import re
    
    ssl._create_default_https_context = ssl._create_unverified_context
    
    # ... Rest of your code ...
    
    options = uc.ChromeOptions()
    driver = uc.Chrome(options=options)
    driver.get("http://www.servicealberta.gov.ab.ca/find-if-business-is-licenced.cfm")
    
    # Click the button to load initial content
    click_on_button = driver.find_element(By.CSS_SELECTOR, "td:nth-child(1) input:nth-child(1)")
    click_on_button.click()
    
    
    time.sleep(2)
    
    base_url = "http://www.servicealberta.gov.ab.ca/"
    
    
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        
    data = BeautifulSoup(driver.page_source, "html.parser")
    links = data.select("tbody td[colspan='4'] a")
    for link in links:
        url = base_url + link['href']
        print(url)
    print(len(url))

推荐答案

要从pandas DataFrame中的页面获取所有链接,您可以使用以下示例:

import pandas as pd
import requests
from bs4 import BeautifulSoup

api_url = "http://www.servicealberta.gov.ab.ca/find-if-business-is-licenced.cfm"
payload = {"faction": "SearchResults", "BusName": "", "BusType": "all", "BusMunc": ""}

soup = BeautifulSoup(requests.post(api_url, data=payload).content, "html.parser")

all_data = []
for row in soup.select('td[colspan="4"]:has(a)'):
    link = row.a["href"]
    name = row.a.get_text(strip=True, separator=" ")
    type_ = row.find_next("td").get_text(strip=True)
    area = row.find_next("td").find_next("td").get_text(strip=True)

    all_data.append({"name": name, "type": type_, "area": area, "link": link})

df = pd.DataFrame(all_data)
print(df.tail())

打印:

                                                                   name                type      area                                                                 link
3816                       ZOUPPAS BARRY SCOTT d.b.a ARDCO CONSTRUCTION  Prepaid Contractor   AIRDRIE  /find-if-business-is-licenced.cfm?faction=SearchDetails&BusID=25581
3817  ZSA LEGAL RECRUITMENT LIMITED d.b.a ZSA LEGAL RECRUITMENT LIMITED   Employment Agency   CALGARY   /find-if-business-is-licenced.cfm?faction=SearchDetails&BusID=9367
3818                                  ZU HOUSE LTD. d.b.a ZU HOUSE LTD.  Prepaid Contractor  EDMONTON  /find-if-business-is-licenced.cfm?faction=SearchDetails&BusID=43181
3819        ZYIA ACTIVE CANADA LIMITED d.b.a ZYIA ACTIVE CANADA LIMITED       Direct Seller  EDMONTON  /find-if-business-is-licenced.cfm?faction=SearchDetails&BusID=44542
3820                    ZZ CONSTRUCTION LTD. d.b.a ZZ CONSTRUCTION LTD.  Prepaid Contractor   CALGARY  /find-if-business-is-licenced.cfm?faction=SearchDetails&BusID=33207

Python相关问答推荐

如何确保Flask应用程序管理面板中的项目具有单击删除功能?

自动编码器和极坐标

如何从格式为note:{neighbor:weight}的字典中构建networkx图?

如果AST请求默认受csref保护,那么在Django中使用@ system_decorator(csref_protect)的目的是什么?

如何销毁框架并使其在tkinter中看起来像以前的样子?

将HTML输出转换为表格中的问题

如何使用SubProcess/Shell从Python脚本中调用具有几个带有html标签的参数的Perl脚本?

具有多个选项的计数_匹配

线性模型PanelOLS和statmodels OLS之间的区别

连接两个具有不同标题的收件箱

试图找到Python方法来部分填充numpy数组

有症状地 destruct 了Python中的regex?

用NumPy优化a[i] = a[i-1]*b[i] + c[i]的迭代计算

如何调整QscrollArea以正确显示内部正在变化的Qgridlayout?

在Python argparse包中添加formatter_class MetavarTypeHelpFormatter时, - help不再工作""""

使用groupby方法移除公共子字符串

合并帧,但不按合并键排序

使用Python和文件进行模糊输出

Tkinter菜单自发添加额外项目

在matplotlib中删除子图之间的间隙_mosaic