My objective is that for website https://data.eastmoney.com/executive/000001.html, when you scroll down you will find a big table 000001 Table and I want to turn it 成 a DataFrame in Python. Is BeautifulSoup enough to do so or do I have to use Selenium?

Stack OverFlow上的一些人说BeautifulSoup无法从互联网上抓取表数据,所以我try 了Selenium,这是代码:

driver = webdriver.Chrome()
driver.get('https://data.eastmoney.com/executive/000001.html')
table_element = driver.find_element_by_xpath("//table")
item_element = table_element.find_element_by_xpath("//tr[2]/td[3]")
item_text = item_element.text
df = pd.DataFrame([item_text], columns=["Item"])
print(df)
driver.quit()

结果如下:

Traceback (most recent call last):
  File "selenium/webdriver/common/service.py", line 76, in start
    stdin=PIPE)
  File "subprocess.py", line 709, in __init__
    restore_signals, start_new_session)
  File "subprocess.py", line 1344, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'chromedriver': 'chromedriver'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/tmp/jqcore/jqboson/jqboson/core/entry.py", line 379, in _run
    engine.start()
  File "/tmp/jqcore/jqboson/jqboson/core/engine.py", line 231, in start
    self._dispatcher.start()
  File "/tmp/jqcore/jqboson/jqboson/core/dispatcher.py", line 280, in start
    self._run_loop()
  File "/tmp/jqcore/jqboson/jqboson/core/dispatcher.py", line 240, in _run_loop
    self._loop.run()
  File "/tmp/jqcore/jqboson/jqboson/core/loop/loop.py", line 107, in run
    self._handle_queue()
  File "/tmp/jqcore/jqboson/jqboson/core/loop/loop.py", line 153, in _handle_queue
    message.callback(**message.callback_data)
  File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_subscriber.py", line 228, in broadcast
    consumer.send(market_data)
  File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_consumer_manager.py", line 59, in    consumer_gen
    msg_callback()
  File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_consumer_manager.py", line 52, in   msg_callback
    callback(market_data)
  File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_consumer_manager.py", line 122, in  wrapper
    result = callback(*args, **kwargs)
  File "/tmp/jqcore/jqboson/jqboson/core/strategy.py", line 474, in _wrapper
    self._context.current_dt
  File "/tmp/strategy/user_code.py", line 85, in handle_data
    driver = webdriver.Chrome()
  File "selenium/webdriver/chrome/webdriver.py", line 73, in __init__
    self.service.start()
  File "selenium/webdriver/common/service.py", line 83, in start
    os.path.basename(self.path), self.start_error_message)
selenium.common.exceptions.WebDriverException: Message: 'chromedriver' executable needs to be in     PATH. Please see https://sites.google.com/a/chromium.org/chromedriver/home

基本上它说"Chromedriver. Exec需要位于PATH中".问题是我正在使用一个名为JoinQuant(www.joinquant.com)的在线回溯测试平台service.py所以这对Selenium来说很复杂-我必须使用Selenium从互联网上抓取这样的数据并将其转化为Python中的DataFrame吗?或者我可以使用BeautifulSoup等其他东西吗?对于BeautifulSoup来说,至少它不存在"驱动器需要处于PATH"的问题.

对于BeautifulSoup,我try 了以下内容:

# Web Crawler
# Sent HTTP Request to get Internet content
url = 'https://data.eastmoney.com/executive/000001.html'
response = requests.get(url)
html_content = response.text

# Check if the request is successful
if response.status_code == 200:
    # Use BeautifulSoup to Analyze Internet information and get the table
    soup = BeautifulSoup(html_content, 'html.parser')
    table = soup.find_all('table')
    # Acquire the rows and columns of the table
    rows = table.find_all('tr')
    data = []
    for row in rows:
        cols = row.find_all('td')
        row_data = []
        for col in cols:
            row_data.append(col.text.strip())
        data.append(row_data)
else:
    print("Failed to Retrieve the Webpage.")

# Set up DataFrame
dataframe = pd.DataFrame(data)
# Print DataFrame
print(dataframe)

这是输出:

Traceback (most recent call last):
  File "/tmp/jqcore/jqboson/jqboson/core/entry.py", line 379, in _run
    engine.start()
  File "/tmp/jqcore/jqboson/jqboson/core/engine.py", line 231, in start
    self._dispatcher.start()
  File "/tmp/jqcore/jqboson/jqboson/core/dispatcher.py", line 280, in start
    self._run_loop()
  File "/tmp/jqcore/jqboson/jqboson/core/dispatcher.py", line 240, in _run_loop
    self._loop.run()
  File "/tmp/jqcore/jqboson/jqboson/core/loop/loop.py", line 107, in run
    self._handle_queue()
  File "/tmp/jqcore/jqboson/jqboson/core/loop/loop.py", line 153, in _handle_queue
    message.callback(**message.callback_data)
  File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_subscriber.py", line 228, in broadcast
    consumer.send(market_data)
  File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_consumer_manager.py", line 59, in   consumer_gen
    msg_callback()
  File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_consumer_manager.py", line 52, in  msg_callback
    callback(market_data)
  File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_consumer_manager.py", line 122, in  wrapper
    result = callback(*args, **kwargs)
  File "/tmp/jqcore/jqboson/jqboson/core/strategy.py", line 474, in _wrapper
    self._context.current_dt
  File "/tmp/strategy/user_code.py", line 114, in handle_data
    rows = table.find_all('tr')
  File "bs4/element.py", line 1884, in __getattr__
    "ResultSet object has no attribute '%s'. You're probably treating a list of items like a   single item. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of   items like a single item. Did you call find_all() when you meant to call find()?

但如果你改变

table = soup.find_all('table')

table = soup.find('table')

结果如下:

Traceback (most recent call last):
  File "/tmp/jqcore/jqboson/jqboson/core/entry.py", line 379, in _run
    engine.start()
  File "/tmp/jqcore/jqboson/jqboson/core/engine.py", line 231, in start
    self._dispatcher.start()
  File "/tmp/jqcore/jqboson/jqboson/core/dispatcher.py", line 280, in start
    self._run_loop()
  File "/tmp/jqcore/jqboson/jqboson/core/dispatcher.py", line 240, in _run_loop
    self._loop.run()
  File "/tmp/jqcore/jqboson/jqboson/core/loop/loop.py", line 107, in run
    self._handle_queue()
  File "/tmp/jqcore/jqboson/jqboson/core/loop/loop.py", line 153, in _handle_queue
    message.callback(**message.callback_data)
  File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_subscriber.py", line 228, in broadcast
    consumer.send(market_data)
  File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_consumer_manager.py", line 59, in   consumer_gen
    msg_callback()
  File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_consumer_manager.py", line 52, in msg_callback
    callback(market_data)
  File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_consumer_manager.py", line 122, in wrapper
    result = callback(*args, **kwargs)
  File "/tmp/jqcore/jqboson/jqboson/core/strategy.py", line 474, in _wrapper
    self._context.current_dt
  File "/tmp/strategy/user_code.py", line 114, in handle_data
    rows = table.find_all('tr')
AttributeError: 'NoneType' object has no attribute 'find_all'

那么总结一下,我应该使用哪一个呢?Selium还是Beautiful汤?或者甚至是其他东西?我应该如何解决这个问题?

推荐答案

不需要使用seleniumbeautifulsoup,在我看来,最简单/最直接的方法是使用提取数据的API.

如何知道在这种情况下内容是否动态加载/呈现?

First indicator, call up the website as a human in the browser and notice that a loading animation / delay appears for the area. Second indicator, the content is not included in the static response to the request. You can now use the browser's developer tools to look at the XHR Requests tab to see which data is being loaded from which resources. -> 100

如果有api use it,则 Select selenium.

网址 :

https://datacenter-web.eastmoney.com/api/data/v1/get

参数 :

reportName: RPT_EXECUTIVE_HOLD_DETAILS
columns: ALL
filter: (SECURITY_CODE="000001")
pageNumber: 1
pageSize: 100 #increase this to avoid paging
Example
import requests
import pandas as pd

pd.DataFrame(
    requests.get('https://datacenter-web.eastmoney.com/api/data/v1/get?reportName=RPT_EXECUTIVE_HOLD_DETAILS&columns=ALL&filter=(SECURITY_CODE%3D%22000001%22)&pageNumber=1&pageSize=30')\
        .json().get('result').get('data')
)
SECURITY_CODE DERIVE_SECURITY_CODE SECURITY_NAME CHANGE_DATE PERSON_NAME CHANGE_SHARES AVERAGE_PRICE CHANGE_AMOUNT CHANGE_REASON CHANGE_RATIO CHANGE_AFTER_HOLDNUM HOLD_TYPE DSE_PERSON_NAME POSITION_NAME PERSON_DSE_RELATION ORG_CODE GGEID BEGIN_HOLD_NUM END_HOLD_NUM
0 000001 000001.SZ 平安银行 2021-09-06 00:00:00 谢永林 26700 18.01 480867 竞价交易 0.0001 26700 A股 谢永林 董事 本人 10004085 173000004782302008 nan 26700
1 000001 000001.SZ 平安银行 2021-09-06 00:00:00 项有志 4000 18.46 73840 竞价交易 0.0001 26000 A股 项有志 董事,副行长,首席财务官 本人 10004085 173000004782302010 nan 26000
...
32 000001 000001.SZ 平安银行 2009-08-19 00:00:00 刘巧莉 46200 21.04 972048 竞价交易 0.0015 nan A股 马黎民 监事 配偶 10004085 140000000281406241 nan nan
33 000001 000001.SZ 平安银行 2007-07-09 00:00:00 王魁芝 1600 27.9 44640 二级市场买卖 0.0001 7581 A股 王魁芝 监事 本人 10004085 173000001049726006 5981 7581

Python相关问答推荐

计算每月过go x年的平均值

如何修复fpdf中的线路出血

如何判断. text文件中的某个字符,然后读取该行

FastAPI:使用APIRouter路由子模块功能

来自ARIMA结果的模型方程

在后台运行的Python函数

从 struct 类型创建MultiPolygon对象,并使用Polars列出[list[f64]列

如何防止Plotly在输出到PDF时减少行中的点数?

使用from_pandas将GeDataFrame转换为polars失败,ArrowType错误:未传递numpy. dype对象

Python中的负前瞻性regex遇到麻烦

通过优化空间在Python中的饼图中添加标签

带条件计算最小值

数据抓取失败:寻求帮助

如何创建一个缓冲区周围的一行与manim?

无法在Docker内部运行Python的Matlab SDK模块,但本地没有问题

joblib:无法从父目录的另一个子文件夹加载转储模型

删除marplotlib条形图上的底边

用渐近模计算含符号的矩阵乘法

合并帧,但不按合并键排序

合并与拼接并举