出于某种原因,这个www.example.com刮擦器是工作propillance,如果我在一个网站上运行它

  1. a. https://clutch.co/us/web-developers-美国-类别:它工作得很棒
  2. b. https://clutch.co/il/web-developers-以色列-类别:它不起作用

所以当我运行这个代码时,它只会从第一页获取信息,然后关闭它自己.我添加了等待,以允许页面加载,但它没有帮助.当您观看浏览器时,您可以看到它滚动到页面底部,但随后关闭.

好吧,这运行为我—见下文:但只为美国网站,而不是为其他人,例如.以色列网站:a.https://clutch.co/us/web-developers—运行得很好. b. https://clutch.co/il/web-developers—它停止并给出了一个整体的错误回来.

嗯-似乎有时在定位类名为‘Provider-Info’的元素时可能会有一些问题:我猜这可能是由于Clutch.co-site上网站 struct 的变化,或者另一方面是一些时间问题.我认为应该开始处理潜在的异常;这一点对我来说是有效的:

import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, NoSuchElementException
import pandas as pd
import time

website = "https://clutch.co/us/web-developers"
options = webdriver.ChromeOptions()
options.add_experimental_option("detach", False)

driver = webdriver.Chrome(options=options)
driver.get(website)

wait = WebDriverWait(driver, 10)

# Function to handle page navigation
def navigate_to_next_page():
    try:
        next_page = driver.find_element(By.XPATH, '//li[@class="page-item next"]/a[@class="page-link"]')
        np = next_page.get_attribute('href')
        driver.get(np)
        time.sleep(6)
        return True
    except:
        return False

company_names = []
taglines = []
locations = []
costs = []
ratings = []

current_page = 1
last_page = 250

while current_page <= last_page:
    try:
        company_elements = wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME, 'provider-info')))
    except TimeoutException:
        print("Timeout Exception occurred while waiting for company elements.")
        break

    for company_element in company_elements:
        try:
            company_name = company_element.find_element(By.CLASS_NAME, "company_info").text
            company_names.append(company_name)

            tagline = company_element.find_element(By.XPATH, './/p[@class="company_info__wrap tagline"]').text
            taglines.append(tagline)

            rating = company_element.find_element(By.XPATH, './/span[@class="rating sg-rating__number"]').text
            ratings.append(rating)

            location = company_element.find_element(By.XPATH, './/span[@class="locality"]').text
            locations.append(location)

            cost = company_element.find_element(By.XPATH, './/div[@class="list-item block_tag custom_popover"]').text
            costs.append(cost)
        except NoSuchElementException:
            print("Element not found while extracting company details.")
            continue

    current_page += 1

    if not navigate_to_next_page():
        break

driver.close()

data = {'Company_Name': company_names, 'Tagline': taglines, 'location': locations, 'Ticket_Price': costs, 'Rating': ratings}
df = pd.DataFrame(data)
df.to_csv('companies_test1.csv', index=False)
print(df)

它返回了以下内容

  import pandas as pd
Timeout Exception occurred while waiting for company elements.
                    Company_Name  ... Rating
0           Hyperlink InfoSystem  ...    4.9
1             Plego Technologies  ...    5.0
2                  Azuro Digital  ...    4.9
3                     Savas Labs  ...    5.0
4               The Gnar Company  ...    4.8
5            Sunrise Integration  ...    5.0
6             Baytech Consulting  ...    5.0
7                Inventive Works  ...    4.9
8                        Utility  ...    4.8
9                     Busy Human  ...    5.0
10                     Rootstrap  ...    4.8
11                        micro1  ...    4.9
12                  ChopDawg.com  ...    4.8
13             Emergent Software  ...    4.9
14         Beehive Software Inc.  ...    5.0
15                   3 Media Web  ...    4.9
16                     Webstacks  ...    5.0
17                Mutually Human  ...    5.0
18                    AnyforSoft  ...    4.8
19                  NL Softworks  ...    5.0
20  OpenSource Technologies Inc.  ...    4.8
21                Marcel Digital  ...    4.8
22                      Twin Sun  ...    5.0
23          SPARK Business Works  ...    4.9
24                        Darwin  ...    4.9
25                       Perrill  ...    5.0
26                          Nimi  ...    4.9
27                        Scopic  ...    4.9
28        Interactive Strategies  ...    4.9
29        Unleashed Technologies  ...    4.9
30                         Oyova  ...    4.9
31                  BrandExtract  ...    4.9
32             The Brick Factory  ...    4.9
33             My Web Programmer  ...    5.0
34                PureLogics LLC  ...    4.9
35                 Social Driver  ...    4.9
36            Calibrate Software  ...    4.9
37                    VisualFizz  ...    5.0
38               Camber Creative  ...    4.9
39               Susco Solutions  ...    4.9
40                  Lunarbyte.io  ...    5.0
41                    thoughtbot  ...    4.9
42         CR Software Solutions  ...    5.0
43             Solwey Consulting  ...    5.0
44                        Ambaum  ...    4.9
45          Pacific Codeline LLC  ...    5.0
46                          PERC  ...    5.0
47                   Beesoul LLC  ...    4.9
48                  Novalab Tech  ...    5.0
49                   Dragon Army  ...    5.0

[50 rows x 5 columns]

以及存储的以下数据:

进程已完成,退出代码为0

Company_Name,Tagline,Location,Ticket_Price,Rating,Website_Name,URL
Hyperlink InfoSystem,"#1 Mobile App, Web, & Software Development Company","Jersey City, NJ","$25,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Plego Technologies,Shaping the Future of Technology,"Downers Grove, IL","$10,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Azuro Digital,"Award-Winning Web Design, Development & SEO","New York, NY","$10,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
App Makers USA,Top US Mobile & Web App Development Agency,"Los Angeles, CA","$10,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
ChopDawg.com,Dreams Delivered Since 2009. Let's Make It App'n!®,"Philadelphia, PA","$5,000+",4.8,Top Web Developers in the United States,https://clutch.co/us/web-developers
Savas Labs,Designing and developing elegant web products.,"Raleigh, NC","$25,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
The Gnar Company,Solving Gnarly Software Problems. Faster.,"Boston, MA","$25,000+",4.8,Top Web Developers in the United States,https://clutch.co/us/web-developers
Sunrise Integration,Enterprise Solutions & Ecommerce Apps,"Los Angeles, CA","$10,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Baytech Consulting,TRANSLATING YOUR VISION INTO SOFTWARE,"Irvine, CA","$25,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Inventive Works,Custom Software Product Development,"Manor, TX","$25,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Utility,AWARD-WINNING MOBILE DESIGN & DEVELOPMENT AGENCY,"New York, NY","$50,000+",4.8,Top Web Developers in the United States,https://clutch.co/us/web-developers
Busy Human,Making life more user-friendly,"Orem, UT","$1,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Rootstrap,Outcome-driven development. At any scale.,"Beverly Hills, CA","$50,000+",4.8,Top Web Developers in the United States,https://clutch.co/us/web-developers
micro1,"World-class software engineers, powered by AI","Los Angeles, CA","$25,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Emergent Software,Your Full-Stack Technology Partner,"Saint Paul, MN","$25,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
3 Media Web,Award-Winning Digital Experience Agency 🏆🏆🏆,"Marlborough, MA","$50,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Beehive Software Inc.,Software reinvented,"Los Gatos, CA","$10,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Webstacks,"The website is a product, not a project.","San Diego, CA","$10,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Mutually Human,Custom Software Development and Design,"Ada, MI","$25,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
AnyforSoft,Amplify digital excellence with AnyforSoft,"Sarasota, FL","$50,000+",4.8,Top Web Developers in the United States,https://clutch.co/us/web-developers
NL Softworks,Website Design & Development Made to Convert,"Boston, MA","$5,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
OpenSource Technologies Inc.,Web & Mobile APP | Digital Marketing | Cloud,"Lansdale, PA","$25,000+",4.8,Top Web Developers in the United States,https://clutch.co/us/web-developers
Twin Sun,Trustworthy partners that deliver results,"Nashville, TN","$25,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Marcel Digital,Changing the Idea of What an Agency Is And Can Be,"Chicago, IL","$5,000+",4.7,Top Web Developers in the United States,https://clutch.co/us/web-developers
Darwin,We create incredible digital experiences,"Reston, VA","$10,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
SPARK Business Works,Award-winning custom software dev & web design,"Kalamazoo, MI","$5,000+",4.8,Top Web Developers in the United States,https://clutch.co/us/web-developers
Nimi,"Bring your product ideas to life, to Grow Today.","Oakland, CA","$10,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Scopic,"Your Cross-continental, Digital Innovation Partner","Rutland, MA","$5,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Interactive Strategies,"Full Service Digital Design, Dev & Marketing","Washington, DC","$100,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Unleashed Technologies,Unleash Your Potential®,"Ellicott City, MD","$10,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Social Driver,Experience digital with us.,"Washington, DC","$25,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Oyova,More Business For Your Business Is Our Business.™,"Jacksonville Beach, FL","$5,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
The Brick Factory,A DC-based digital agency.,"Washington, DC","$10,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
My Web Programmer,→Top-Quality Custom Software & Web Development Co.,"Atlanta, GA","$1,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
PureLogics LLC,No Magic. Just Logic.,"New York, NY","$10,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
BrandExtract,"We inspire people to create, transform, and grow.","Houston, TX","$25,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Calibrate Software,We craft digital experiences that spark joy 🎉,"Chicago, IL","$25,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Camber Creative,Things worth building are worth building well.,"Orlando, FL","$25,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
VisualFizz,Impactful Marketing for Industry-Leading Brands,"Chicago, IL","$5,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Susco Solutions,Solve Together | Developing Intuitive Software,"Harvey, LA","$50,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Lunarbyte.io,Launching big ideas with startups & enterprises,"Seattle, WA","$25,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
CR Software Solutions,Innovative Digital Solutions For Your Business,"Canton, MI","$5,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Ambaum,Ambaum is your Shopify Plus Agency,"Burien, WA","$5,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Solwey Consulting,Custom software solutions to elevate your business,"Austin, TX","$10,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Pacific Codeline LLC,"Reliable, Experienced, 100% U.S. based.","San Clemente, CA","$1,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Novalab Tech,Your Trusted IT Partner,"San Francisco, CA","$10,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Dragon Army,A purpose-driven digital engagement company.,"Atlanta, GA","$25,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
CodigoDelSur,Rockstar coders for rockstar companies,"Montevideo, Uruguay","$75,000+",4.8,Top Web Developers in the United States,https://clutch.co/us/web-developers
Brainhub,Top 1.36% engineering team - onboarding in 10 days,"Gliwice, Poland","$50,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Curotec,Your digital product engineering department,"Philadelphia, PA","$10,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
TekRevol,Creative Web | App | Software Development Company,"Houston, TX","$25,000+",4.8,Top Web Developers in the United States,https://clutch.co/us/web-developers
XWP,Building a better web at enterprise scale,"New York, NY","$50,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Five Jars,⭐️⭐️⭐️⭐️⭐️ OUTSTANDING WEB DESIGN & DEVELOPMENT,"Brooklyn, NY","$10,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers

嗯—但是等等:它在这里不工作—如果我们 Select 另一个base—URL https://clutch.co/il/web-developers

company details.
Element not found while extracting company details.
Element not found while extracting company details.
Timeout Exception occurred while waiting for company elements.
Traceback (most recent call last):
  File "/home/ubuntu/.config/JetBrains/PyCharmCE2023.3/scratches/scratch.py", line 74, in <module>
    df = pd.DataFrame(data)
         ^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/PycharmProjects/clutch_scraper_2/.venv/lib/python3.11/site-packages/pandas/core/frame.py", line 767, in __init__
    mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/PycharmProjects/clutch_scraper_2/.venv/lib/python3.11/site-packages/pandas/core/internals/construction.py", line 503, in dict_to_mgr
    return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/PycharmProjects/clutch_scraper_2/.venv/lib/python3.11/site-packages/pandas/core/internals/construction.py", line 114, in arrays_to_mgr
    index = _extract_index(arrays)
            ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/PycharmProjects/clutch_scraper_2/.venv/lib/python3.11/site-packages/pandas/core/internals/construction.py", line 677, in _extract_index
    raise ValueError("All arrays must be of the same length")
ValueError: All arrays must be of the same length

Process finished with exit code 1

嗯,我想这和一些例外有关,

  import pandas as pd
Element not found while extracting company details.
Element not found while extracting company details.
Element not found while extracting company details.
Element not found while extracting company details.
Element not found while extracting company details.
Element not found while extracting company details.
Element not found while extracting company details.
Timeout Exception occurred while waiting for company elements.

嗯,我认为可能会有一整对问题:

首先,在提取公司详细信息时,有一些元素没有找到:这表明在提取某些公司的详细信息时,没有找到一些元素.这可能是由于网站 struct 的变化或布局的变化.我想我们可以处理这一点;因此,我们应该包括额外的错误处理或改进我们的XPath表达式.

在几次try 和try 中,也发生了异常,而等待公司元素:这表明脚本在等待元素加载到页面上时超时.

最后但并非最不重要的一点是,我还遇到了ValueError:所有数组的长度必须相同:出现此错误是因为用于构造DataFrame的数组的长度不同.当一个或多个数据点未正确收集时,通常会发生这种情况.

请看下面我使用的代码:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, NoSuchElementException
import pandas as pd
import time

website = "https://clutch.co/il/it-services"
options = webdriver.ChromeOptions()
options.add_experimental_option("detach", False)

driver = webdriver.Chrome(options=options)
driver.get(website)

wait = WebDriverWait(driver, 20)

# Function to handle page navigation
def navigate_to_next_page():
    try:
        next_page = driver.find_element(By.XPATH, '//li[@class="page-item next"]/a[@class="page-link"]')
        np = next_page.get_attribute('href')
        driver.get(np)
        time.sleep(6)
        return True
    except:
        return False

company_names = []
taglines = []
locations = []
costs = []
ratings = []
websites = []

current_page = 1
last_page = 250

while current_page <= last_page:
    try:
        company_elements = wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME, 'provider-info')))
    except TimeoutException:
        print("Timeout Exception occurred while waiting for company elements.")
        break

    for company_element in company_elements:
        try:
            company_name = company_element.find_element(By.CLASS_NAME, "company_info").text
            company_names.append(company_name)

            tagline = company_element.find_element(By.XPATH, './/p[@class="company_info__wrap tagline"]').text
            taglines.append(tagline)

            rating = company_element.find_element(By.XPATH, './/span[@class="rating sg-rating__number"]').text
            ratings.append(rating)

            location = company_element.find_element(By.XPATH, './/span[@class="locality"]').text
            locations.append(location)

            cost = company_element.find_element(By.XPATH, './/div[@class="list-item block_tag custom_popover"]').text
            costs.append(cost)

            # Extracting website URL
            website_element = company_element.find_element(By.XPATH, './/a[@class="website-link"]')
            website_url = website_element.get_attribute('href')
            websites.append(website_url)
        except NoSuchElementException:
            print("Element not found while extracting company details.")
            continue

    current_page += 1

    if not navigate_to_next_page():
        break

driver.close()

# Ensure all arrays have the same length
min_length = min(len(company_names), len(taglines), len(locations), len(costs), len(ratings), len(websites))
company_names = company_names[:min_length]
taglines = taglines[:min_length]
locations = locations[:min_length]
costs = costs[:min_length]
ratings = ratings[:min_length]
websites = websites[:min_length]

data = {'Company_Name': company_names, 'Tagline': taglines, 'Location': locations, 'Ticket_Price': costs, 'Rating': ratings, 'Website': websites}
df = pd.DataFrame(data)

# Check if DataFrame is empty
if not df.empty:
    df.to_csv('companies_test10.csv', index=False)
    print(df)
else:
    print("DataFrame is empty. No data to save.")

推荐答案

不幸的是,我不认为刮擦是一个实际的解决方案.使用API.

让我们逐一回答您的问题:

提取公司详细信息时未找到元素

这个问题很容易解决.这是一个在页面中找不到的元素,所以你可以在它的位置添加一些东西到你用来收集数据的列表中:

[...]
except NoSuchElementException:
    print("提取公司详细信息时未找到元素.")
    company_names.append("")
    taglines.append("")
    ratings.append("")
    locations.append("")
    costs.append("")
    continue
[...]

在等待公司元素时发生异常

这就是你的主要问题.clutch.co使用Cloudflare,在您发出多个请求后,它会开始限制您的请求并将其重定向到captcha页面.他们使用这种方法的原因之一正是为了防止自动机器人收集他们的数据.你可以阅读更多大约here.

因此,当发生这种情况时,您会得到TimeoutException:由于需要一段时间,Selify假定数据不会加载,并引发此异常.你可以增加暂停的时间,但这并不实际,也不会持续很长时间.

首先,你需要 for each 页面解决一个验证码,这很耗时.你可以雇佣一个服务来解决这个问题,但这会花费你的钱.

此外,最重要的是,如果您继续通过Cloudflare进行自动请求,他们可能会在某个时候将您的IP添加到黑名单中,在这种情况下,您将不得不开始使用代理服务.这也会花你的钱.

如果你真的想走这条路,试着用Cloudscraper.

ValueError:所有数组的长度必须相同

这是以前问题的结果.Pandas期望所有包含数据(company_namestaglineslocationscostsratings)的列表具有相同的长度,因为它们是一个框架的行.当它们的长度不相同时,就会引发错误.

所以这样的东西不会起作用...

df = pd.DataFrame({"a": [1, 2], "b": [1]})  # will raise ValueError

但这将

df = pd.DataFrame({"a": [1, 2], "b": [1, 3]})

如果你能解决以上问题并收集所有数据,这个错误也会消失.

使用API

如果API提供了您需要的所有数据,我建议您使用它,即使它是付费的,而不是试图抓取数据.这将大大减少错误的倾向,并需要更少的开发时间.最后,你可能会省钱.

Python相关问答推荐

我必须将Sigmoid函数与r2值的两种类型的数据集(每种6个数据集)进行匹配,然后绘制匹配函数的求导.我会犯错

将特定列信息移动到当前行下的新行

可变参数数量的重载类型(args或kwargs)

在Mac上安装ipython

利用Selenium和Beautiful Soup实现Web抓取JavaScript表

转换为浮点,pandas字符串列,混合千和十进制分隔符

如何在Pyplot表中舍入值

Polars将相同的自定义函数应用于组中的多个列,

计算空值

如果包含特定值,则筛选Groupby

将CSS链接到HTML文件的问题

获取PANDA GROUP BY转换中的组的名称

Python日志(log)库如何有效地获取lineno和funcName?

在任何要保留的字段中添加引号的文件,就像在Pandas 中一样

高效地计算数字数组中三行上三个点之间的Angular

使用美汤对维基百科表格进行网络刮擦未返回任何内容

使用Scikit的ValueError-了解

如何通过特定导入在类中执行Python代码

如何让PYTHON上的Selify连接到现有的Firefox实例-我无法连接到Marionette端口

合并Pandas中的数据帧,但处理不存在的列