Python 滚动以获取所有链接

发布于10月04日

有3821个链接，但他们只给了我103个链接，我也对链接应用了window.scroll的条件，但它们都不起作用

    from selenium.webdriver.common.by import By
    from selenium.webdriver.common.keys import Keys
    from selenium.webdriver.common.action_chains import ActionChains
    from selenium.webdriver.common.keys import Keys
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    import time
    import ssl
    import undetected_chromedriver as uc
    import requests
    from bs4 import BeautifulSoup
    import re
    
    ssl._create_default_https_context = ssl._create_unverified_context
    
    # ... Rest of your code ...
    
    options = uc.ChromeOptions()
    driver = uc.Chrome(options=options)
    driver.get("http://www.servicealberta.gov.ab.ca/find-if-business-is-licenced.cfm")
    
    # Click the button to load initial content
    click_on_button = driver.find_element(By.CSS_SELECTOR, "td:nth-child(1) input:nth-child(1)")
    click_on_button.click()
    
    
    time.sleep(2)
    
    base_url = "http://www.servicealberta.gov.ab.ca/"
    
    
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        
    data = BeautifulSoup(driver.page_source, "html.parser")
    links = data.select("tbody td[colspan='4'] a")
    for link in links:
        url = base_url + link['href']
        print(url)
    print(len(url))

import pandas as pd import requests from bs4 import BeautifulSoup api_url = "http://www.servicealberta.gov.ab.ca/find-if-business-is-licenced.cfm" payload = {"faction": "SearchResults", "BusName": "", "BusType": "all", "BusMunc": ""} soup = BeautifulSoup(requests.post(api_url, data=payload).content, "html.parser") all_data = [] for row in soup.select('td[colspan="4"]:has(a)'): link = row.a["href"] name = row.a.get_text(strip=True, separator=" ") type_ = row.find_next("td").get_text(strip=True) area = row.find_next("td").find_next("td").get_text(strip=True) all_data.append({"name": name, "type": type_, "area": area, "link": link}) df = pd.DataFrame(all_data) print(df.tail())

name type area link 3816 ZOUPPAS BARRY SCOTT d.b.a ARDCO CONSTRUCTION Prepaid Contractor AIRDRIE /find-if-business-is-licenced.cfm?faction=SearchDetails&BusID=25581 3817 ZSA LEGAL RECRUITMENT LIMITED d.b.a ZSA LEGAL RECRUITMENT LIMITED Employment Agency CALGARY /find-if-business-is-licenced.cfm?faction=SearchDetails&BusID=9367 3818 ZU HOUSE LTD. d.b.a ZU HOUSE LTD. Prepaid Contractor EDMONTON /find-if-business-is-licenced.cfm?faction=SearchDetails&BusID=43181 3819 ZYIA ACTIVE CANADA LIMITED d.b.a ZYIA ACTIVE CANADA LIMITED Direct Seller EDMONTON /find-if-business-is-licenced.cfm?faction=SearchDetails&BusID=44542 3820 ZZ CONSTRUCTION LTD. d.b.a ZZ CONSTRUCTION LTD. Prepaid Contractor CALGARY /find-if-business-is-licenced.cfm?faction=SearchDetails&BusID=33207

Python 滚动以获取所有链接

推荐答案

Python相关问答推荐

如何确保Flask应用程序管理面板中的项目具有单击删除功能？

自动编码器和极坐标

如何从格式为note：{neighbor：weight}的字典中构建networkx图？

如果AST请求默认受csref保护，那么在Django中使用@ system_decorator(csref_protect)的目的是什么？

如何销毁框架并使其在tkinter中看起来像以前的样子？

将HTML输出转换为表格中的问题

如何使用SubProcess/Shell从Python脚本中调用具有几个带有html标签的参数的Perl脚本？

具有多个选项的计数_匹配

线性模型PanelOLS和statmodels OLS之间的区别

连接两个具有不同标题的收件箱

试图找到Python方法来部分填充numpy数组

有症状地 destruct 了Python中的regex？

用NumPy优化a[i] = a[i-1]*b[i] + c[i]的迭代计算

如何调整QscrollArea以正确显示内部正在变化的Qgridlayout？

在Python argparse包中添加formatter_class MetavarTypeHelpFormatter时， - help不再工作""""

使用groupby方法移除公共子字符串

合并帧，但不按合并键排序

使用Python和文件进行模糊输出

Tkinter菜单自发添加额外项目

在matplotlib中删除子图之间的间隙_mosaic