出于某种原因,这个www.example.com刮擦器是工作propillance,如果我在一个网站上运行它
- a. https://clutch.co/us/web-developers-美国-类别:它工作得很棒
- b. https://clutch.co/il/web-developers-以色列-类别:它不起作用
所以当我运行这个代码时,它只会从第一页获取信息,然后关闭它自己.我添加了等待,以允许页面加载,但它没有帮助.当您观看浏览器时,您可以看到它滚动到页面底部,但随后关闭.
好吧,这运行为我—见下文:但只为美国网站,而不是为其他人,例如.以色列网站:a.https://clutch.co/us/web-developers—运行得很好. b. https://clutch.co/il/web-developers—它停止并给出了一个整体的错误回来.
嗯-似乎有时在定位类名为‘Provider-Info’的元素时可能会有一些问题:我猜这可能是由于Clutch.co-site上网站 struct 的变化,或者另一方面是一些时间问题.我认为应该开始处理潜在的异常;这一点对我来说是有效的:
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, NoSuchElementException
import pandas as pd
import time
website = "https://clutch.co/us/web-developers"
options = webdriver.ChromeOptions()
options.add_experimental_option("detach", False)
driver = webdriver.Chrome(options=options)
driver.get(website)
wait = WebDriverWait(driver, 10)
# Function to handle page navigation
def navigate_to_next_page():
try:
next_page = driver.find_element(By.XPATH, '//li[@class="page-item next"]/a[@class="page-link"]')
np = next_page.get_attribute('href')
driver.get(np)
time.sleep(6)
return True
except:
return False
company_names = []
taglines = []
locations = []
costs = []
ratings = []
current_page = 1
last_page = 250
while current_page <= last_page:
try:
company_elements = wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME, 'provider-info')))
except TimeoutException:
print("Timeout Exception occurred while waiting for company elements.")
break
for company_element in company_elements:
try:
company_name = company_element.find_element(By.CLASS_NAME, "company_info").text
company_names.append(company_name)
tagline = company_element.find_element(By.XPATH, './/p[@class="company_info__wrap tagline"]').text
taglines.append(tagline)
rating = company_element.find_element(By.XPATH, './/span[@class="rating sg-rating__number"]').text
ratings.append(rating)
location = company_element.find_element(By.XPATH, './/span[@class="locality"]').text
locations.append(location)
cost = company_element.find_element(By.XPATH, './/div[@class="list-item block_tag custom_popover"]').text
costs.append(cost)
except NoSuchElementException:
print("Element not found while extracting company details.")
continue
current_page += 1
if not navigate_to_next_page():
break
driver.close()
data = {'Company_Name': company_names, 'Tagline': taglines, 'location': locations, 'Ticket_Price': costs, 'Rating': ratings}
df = pd.DataFrame(data)
df.to_csv('companies_test1.csv', index=False)
print(df)
它返回了以下内容
import pandas as pd
Timeout Exception occurred while waiting for company elements.
Company_Name ... Rating
0 Hyperlink InfoSystem ... 4.9
1 Plego Technologies ... 5.0
2 Azuro Digital ... 4.9
3 Savas Labs ... 5.0
4 The Gnar Company ... 4.8
5 Sunrise Integration ... 5.0
6 Baytech Consulting ... 5.0
7 Inventive Works ... 4.9
8 Utility ... 4.8
9 Busy Human ... 5.0
10 Rootstrap ... 4.8
11 micro1 ... 4.9
12 ChopDawg.com ... 4.8
13 Emergent Software ... 4.9
14 Beehive Software Inc. ... 5.0
15 3 Media Web ... 4.9
16 Webstacks ... 5.0
17 Mutually Human ... 5.0
18 AnyforSoft ... 4.8
19 NL Softworks ... 5.0
20 OpenSource Technologies Inc. ... 4.8
21 Marcel Digital ... 4.8
22 Twin Sun ... 5.0
23 SPARK Business Works ... 4.9
24 Darwin ... 4.9
25 Perrill ... 5.0
26 Nimi ... 4.9
27 Scopic ... 4.9
28 Interactive Strategies ... 4.9
29 Unleashed Technologies ... 4.9
30 Oyova ... 4.9
31 BrandExtract ... 4.9
32 The Brick Factory ... 4.9
33 My Web Programmer ... 5.0
34 PureLogics LLC ... 4.9
35 Social Driver ... 4.9
36 Calibrate Software ... 4.9
37 VisualFizz ... 5.0
38 Camber Creative ... 4.9
39 Susco Solutions ... 4.9
40 Lunarbyte.io ... 5.0
41 thoughtbot ... 4.9
42 CR Software Solutions ... 5.0
43 Solwey Consulting ... 5.0
44 Ambaum ... 4.9
45 Pacific Codeline LLC ... 5.0
46 PERC ... 5.0
47 Beesoul LLC ... 4.9
48 Novalab Tech ... 5.0
49 Dragon Army ... 5.0
[50 rows x 5 columns]
以及存储的以下数据:
进程已完成,退出代码为0
Company_Name,Tagline,Location,Ticket_Price,Rating,Website_Name,URL
Hyperlink InfoSystem,"#1 Mobile App, Web, & Software Development Company","Jersey City, NJ","$25,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Plego Technologies,Shaping the Future of Technology,"Downers Grove, IL","$10,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Azuro Digital,"Award-Winning Web Design, Development & SEO","New York, NY","$10,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
App Makers USA,Top US Mobile & Web App Development Agency,"Los Angeles, CA","$10,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
ChopDawg.com,Dreams Delivered Since 2009. Let's Make It App'n!®,"Philadelphia, PA","$5,000+",4.8,Top Web Developers in the United States,https://clutch.co/us/web-developers
Savas Labs,Designing and developing elegant web products.,"Raleigh, NC","$25,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
The Gnar Company,Solving Gnarly Software Problems. Faster.,"Boston, MA","$25,000+",4.8,Top Web Developers in the United States,https://clutch.co/us/web-developers
Sunrise Integration,Enterprise Solutions & Ecommerce Apps,"Los Angeles, CA","$10,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Baytech Consulting,TRANSLATING YOUR VISION INTO SOFTWARE,"Irvine, CA","$25,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Inventive Works,Custom Software Product Development,"Manor, TX","$25,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Utility,AWARD-WINNING MOBILE DESIGN & DEVELOPMENT AGENCY,"New York, NY","$50,000+",4.8,Top Web Developers in the United States,https://clutch.co/us/web-developers
Busy Human,Making life more user-friendly,"Orem, UT","$1,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Rootstrap,Outcome-driven development. At any scale.,"Beverly Hills, CA","$50,000+",4.8,Top Web Developers in the United States,https://clutch.co/us/web-developers
micro1,"World-class software engineers, powered by AI","Los Angeles, CA","$25,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Emergent Software,Your Full-Stack Technology Partner,"Saint Paul, MN","$25,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
3 Media Web,Award-Winning Digital Experience Agency 🏆🏆🏆,"Marlborough, MA","$50,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Beehive Software Inc.,Software reinvented,"Los Gatos, CA","$10,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Webstacks,"The website is a product, not a project.","San Diego, CA","$10,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Mutually Human,Custom Software Development and Design,"Ada, MI","$25,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
AnyforSoft,Amplify digital excellence with AnyforSoft,"Sarasota, FL","$50,000+",4.8,Top Web Developers in the United States,https://clutch.co/us/web-developers
NL Softworks,Website Design & Development Made to Convert,"Boston, MA","$5,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
OpenSource Technologies Inc.,Web & Mobile APP | Digital Marketing | Cloud,"Lansdale, PA","$25,000+",4.8,Top Web Developers in the United States,https://clutch.co/us/web-developers
Twin Sun,Trustworthy partners that deliver results,"Nashville, TN","$25,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Marcel Digital,Changing the Idea of What an Agency Is And Can Be,"Chicago, IL","$5,000+",4.7,Top Web Developers in the United States,https://clutch.co/us/web-developers
Darwin,We create incredible digital experiences,"Reston, VA","$10,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
SPARK Business Works,Award-winning custom software dev & web design,"Kalamazoo, MI","$5,000+",4.8,Top Web Developers in the United States,https://clutch.co/us/web-developers
Nimi,"Bring your product ideas to life, to Grow Today.","Oakland, CA","$10,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Scopic,"Your Cross-continental, Digital Innovation Partner","Rutland, MA","$5,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Interactive Strategies,"Full Service Digital Design, Dev & Marketing","Washington, DC","$100,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Unleashed Technologies,Unleash Your Potential®,"Ellicott City, MD","$10,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Social Driver,Experience digital with us.,"Washington, DC","$25,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Oyova,More Business For Your Business Is Our Business.™,"Jacksonville Beach, FL","$5,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
The Brick Factory,A DC-based digital agency.,"Washington, DC","$10,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
My Web Programmer,→Top-Quality Custom Software & Web Development Co.,"Atlanta, GA","$1,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
PureLogics LLC,No Magic. Just Logic.,"New York, NY","$10,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
BrandExtract,"We inspire people to create, transform, and grow.","Houston, TX","$25,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Calibrate Software,We craft digital experiences that spark joy 🎉,"Chicago, IL","$25,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Camber Creative,Things worth building are worth building well.,"Orlando, FL","$25,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
VisualFizz,Impactful Marketing for Industry-Leading Brands,"Chicago, IL","$5,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Susco Solutions,Solve Together | Developing Intuitive Software,"Harvey, LA","$50,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Lunarbyte.io,Launching big ideas with startups & enterprises,"Seattle, WA","$25,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
CR Software Solutions,Innovative Digital Solutions For Your Business,"Canton, MI","$5,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Ambaum,Ambaum is your Shopify Plus Agency,"Burien, WA","$5,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Solwey Consulting,Custom software solutions to elevate your business,"Austin, TX","$10,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Pacific Codeline LLC,"Reliable, Experienced, 100% U.S. based.","San Clemente, CA","$1,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Novalab Tech,Your Trusted IT Partner,"San Francisco, CA","$10,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Dragon Army,A purpose-driven digital engagement company.,"Atlanta, GA","$25,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
CodigoDelSur,Rockstar coders for rockstar companies,"Montevideo, Uruguay","$75,000+",4.8,Top Web Developers in the United States,https://clutch.co/us/web-developers
Brainhub,Top 1.36% engineering team - onboarding in 10 days,"Gliwice, Poland","$50,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Curotec,Your digital product engineering department,"Philadelphia, PA","$10,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
TekRevol,Creative Web | App | Software Development Company,"Houston, TX","$25,000+",4.8,Top Web Developers in the United States,https://clutch.co/us/web-developers
XWP,Building a better web at enterprise scale,"New York, NY","$50,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Five Jars,⭐️⭐️⭐️⭐️⭐️ OUTSTANDING WEB DESIGN & DEVELOPMENT,"Brooklyn, NY","$10,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
嗯—但是等等:它在这里不工作—如果我们 Select 另一个base—URL https://clutch.co/il/web-developers
company details.
Element not found while extracting company details.
Element not found while extracting company details.
Timeout Exception occurred while waiting for company elements.
Traceback (most recent call last):
File "/home/ubuntu/.config/JetBrains/PyCharmCE2023.3/scratches/scratch.py", line 74, in <module>
df = pd.DataFrame(data)
^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/PycharmProjects/clutch_scraper_2/.venv/lib/python3.11/site-packages/pandas/core/frame.py", line 767, in __init__
mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/PycharmProjects/clutch_scraper_2/.venv/lib/python3.11/site-packages/pandas/core/internals/construction.py", line 503, in dict_to_mgr
return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/PycharmProjects/clutch_scraper_2/.venv/lib/python3.11/site-packages/pandas/core/internals/construction.py", line 114, in arrays_to_mgr
index = _extract_index(arrays)
^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/PycharmProjects/clutch_scraper_2/.venv/lib/python3.11/site-packages/pandas/core/internals/construction.py", line 677, in _extract_index
raise ValueError("All arrays must be of the same length")
ValueError: All arrays must be of the same length
Process finished with exit code 1
嗯,我想这和一些例外有关,
import pandas as pd
Element not found while extracting company details.
Element not found while extracting company details.
Element not found while extracting company details.
Element not found while extracting company details.
Element not found while extracting company details.
Element not found while extracting company details.
Element not found while extracting company details.
Timeout Exception occurred while waiting for company elements.
嗯,我认为可能会有一整对问题:
首先,在提取公司详细信息时,有一些元素没有找到:这表明在提取某些公司的详细信息时,没有找到一些元素.这可能是由于网站 struct 的变化或布局的变化.我想我们可以处理这一点;因此,我们应该包括额外的错误处理或改进我们的XPath表达式.
在几次try 和try 中,也发生了异常,而等待公司元素:这表明脚本在等待元素加载到页面上时超时.
最后但并非最不重要的一点是,我还遇到了ValueError:所有数组的长度必须相同:出现此错误是因为用于构造DataFrame的数组的长度不同.当一个或多个数据点未正确收集时,通常会发生这种情况.
请看下面我使用的代码:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, NoSuchElementException
import pandas as pd
import time
website = "https://clutch.co/il/it-services"
options = webdriver.ChromeOptions()
options.add_experimental_option("detach", False)
driver = webdriver.Chrome(options=options)
driver.get(website)
wait = WebDriverWait(driver, 20)
# Function to handle page navigation
def navigate_to_next_page():
try:
next_page = driver.find_element(By.XPATH, '//li[@class="page-item next"]/a[@class="page-link"]')
np = next_page.get_attribute('href')
driver.get(np)
time.sleep(6)
return True
except:
return False
company_names = []
taglines = []
locations = []
costs = []
ratings = []
websites = []
current_page = 1
last_page = 250
while current_page <= last_page:
try:
company_elements = wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME, 'provider-info')))
except TimeoutException:
print("Timeout Exception occurred while waiting for company elements.")
break
for company_element in company_elements:
try:
company_name = company_element.find_element(By.CLASS_NAME, "company_info").text
company_names.append(company_name)
tagline = company_element.find_element(By.XPATH, './/p[@class="company_info__wrap tagline"]').text
taglines.append(tagline)
rating = company_element.find_element(By.XPATH, './/span[@class="rating sg-rating__number"]').text
ratings.append(rating)
location = company_element.find_element(By.XPATH, './/span[@class="locality"]').text
locations.append(location)
cost = company_element.find_element(By.XPATH, './/div[@class="list-item block_tag custom_popover"]').text
costs.append(cost)
# Extracting website URL
website_element = company_element.find_element(By.XPATH, './/a[@class="website-link"]')
website_url = website_element.get_attribute('href')
websites.append(website_url)
except NoSuchElementException:
print("Element not found while extracting company details.")
continue
current_page += 1
if not navigate_to_next_page():
break
driver.close()
# Ensure all arrays have the same length
min_length = min(len(company_names), len(taglines), len(locations), len(costs), len(ratings), len(websites))
company_names = company_names[:min_length]
taglines = taglines[:min_length]
locations = locations[:min_length]
costs = costs[:min_length]
ratings = ratings[:min_length]
websites = websites[:min_length]
data = {'Company_Name': company_names, 'Tagline': taglines, 'Location': locations, 'Ticket_Price': costs, 'Rating': ratings, 'Website': websites}
df = pd.DataFrame(data)
# Check if DataFrame is empty
if not df.empty:
df.to_csv('companies_test10.csv', index=False)
print(df)
else:
print("DataFrame is empty. No data to save.")