Python 如何将桌子刮成带有Se的筷子要求Beautiful Soup

发布于05月03日

My objective is that for website https://data.eastmoney.com/executive/000001.html, when you scroll down you will find a big table and I want to turn it 成 a DataFrame in Python. Is BeautifulSoup enough to do so or do I have to use Selenium?

Stack OverFlow上的一些人说BeautifulSoup无法从互联网上抓取表数据，所以我try 了Selenium，这是代码:

driver = webdriver.Chrome()
driver.get('https://data.eastmoney.com/executive/000001.html')
table_element = driver.find_element_by_xpath("//table")
item_element = table_element.find_element_by_xpath("//tr[2]/td[3]")
item_text = item_element.text
df = pd.DataFrame([item_text], columns=["Item"])
print(df)
driver.quit()

结果如下:

Traceback (most recent call last):
  File "selenium/webdriver/common/service.py", line 76, in start
    stdin=PIPE)
  File "subprocess.py", line 709, in __init__
    restore_signals, start_new_session)
  File "subprocess.py", line 1344, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'chromedriver': 'chromedriver'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/tmp/jqcore/jqboson/jqboson/core/entry.py", line 379, in _run
    engine.start()
  File "/tmp/jqcore/jqboson/jqboson/core/engine.py", line 231, in start
    self._dispatcher.start()
  File "/tmp/jqcore/jqboson/jqboson/core/dispatcher.py", line 280, in start
    self._run_loop()
  File "/tmp/jqcore/jqboson/jqboson/core/dispatcher.py", line 240, in _run_loop
    self._loop.run()
  File "/tmp/jqcore/jqboson/jqboson/core/loop/loop.py", line 107, in run
    self._handle_queue()
  File "/tmp/jqcore/jqboson/jqboson/core/loop/loop.py", line 153, in _handle_queue
    message.callback(**message.callback_data)
  File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_subscriber.py", line 228, in broadcast
    consumer.send(market_data)
  File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_consumer_manager.py", line 59, in    consumer_gen
    msg_callback()
  File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_consumer_manager.py", line 52, in   msg_callback
    callback(market_data)
  File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_consumer_manager.py", line 122, in  wrapper
    result = callback(*args, **kwargs)
  File "/tmp/jqcore/jqboson/jqboson/core/strategy.py", line 474, in _wrapper
    self._context.current_dt
  File "/tmp/strategy/user_code.py", line 85, in handle_data
    driver = webdriver.Chrome()
  File "selenium/webdriver/chrome/webdriver.py", line 73, in __init__
    self.service.start()
  File "selenium/webdriver/common/service.py", line 83, in start
    os.path.basename(self.path), self.start_error_message)
selenium.common.exceptions.WebDriverException: Message: 'chromedriver' executable needs to be in     PATH. Please see https://sites.google.com/a/chromium.org/chromedriver/home

基本上它说"Chromedriver. Exec需要位于PATH中".问题是我正在使用一个名为JoinQuant(www.joinquant.com)的在线回溯测试平台service.py所以这对Selenium来说很复杂-我必须使用Selenium从互联网上抓取这样的数据并将其转化为Python中的DataFrame吗？或者我可以使用BeautifulSoup等其他东西吗？对于BeautifulSoup来说，至少它不存在"驱动器需要处于PATH"的问题.

对于BeautifulSoup，我try 了以下内容:

# Web Crawler
# Sent HTTP Request to get Internet content
url = 'https://data.eastmoney.com/executive/000001.html'
response = requests.get(url)
html_content = response.text

# Check if the request is successful
if response.status_code == 200:
    # Use BeautifulSoup to Analyze Internet information and get the table
    soup = BeautifulSoup(html_content, 'html.parser')
    table = soup.find_all('table')
    # Acquire the rows and columns of the table
    rows = table.find_all('tr')
    data = []
    for row in rows:
        cols = row.find_all('td')
        row_data = []
        for col in cols:
            row_data.append(col.text.strip())
        data.append(row_data)
else:
    print("Failed to Retrieve the Webpage.")

# Set up DataFrame
dataframe = pd.DataFrame(data)
# Print DataFrame
print(dataframe)

这是输出:

Traceback (most recent call last):
  File "/tmp/jqcore/jqboson/jqboson/core/entry.py", line 379, in _run
    engine.start()
  File "/tmp/jqcore/jqboson/jqboson/core/engine.py", line 231, in start
    self._dispatcher.start()
  File "/tmp/jqcore/jqboson/jqboson/core/dispatcher.py", line 280, in start
    self._run_loop()
  File "/tmp/jqcore/jqboson/jqboson/core/dispatcher.py", line 240, in _run_loop
    self._loop.run()
  File "/tmp/jqcore/jqboson/jqboson/core/loop/loop.py", line 107, in run
    self._handle_queue()
  File "/tmp/jqcore/jqboson/jqboson/core/loop/loop.py", line 153, in _handle_queue
    message.callback(**message.callback_data)
  File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_subscriber.py", line 228, in broadcast
    consumer.send(market_data)
  File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_consumer_manager.py", line 59, in   consumer_gen
    msg_callback()
  File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_consumer_manager.py", line 52, in  msg_callback
    callback(market_data)
  File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_consumer_manager.py", line 122, in  wrapper
    result = callback(*args, **kwargs)
  File "/tmp/jqcore/jqboson/jqboson/core/strategy.py", line 474, in _wrapper
    self._context.current_dt
  File "/tmp/strategy/user_code.py", line 114, in handle_data
    rows = table.find_all('tr')
  File "bs4/element.py", line 1884, in __getattr__
    "ResultSet object has no attribute '%s'. You're probably treating a list of items like a   single item. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of   items like a single item. Did you call find_all() when you meant to call find()?

但如果你改变

table = soup.find_all('table')

成

table = soup.find('table')

结果如下:

Traceback (most recent call last):
  File "/tmp/jqcore/jqboson/jqboson/core/entry.py", line 379, in _run
    engine.start()
  File "/tmp/jqcore/jqboson/jqboson/core/engine.py", line 231, in start
    self._dispatcher.start()
  File "/tmp/jqcore/jqboson/jqboson/core/dispatcher.py", line 280, in start
    self._run_loop()
  File "/tmp/jqcore/jqboson/jqboson/core/dispatcher.py", line 240, in _run_loop
    self._loop.run()
  File "/tmp/jqcore/jqboson/jqboson/core/loop/loop.py", line 107, in run
    self._handle_queue()
  File "/tmp/jqcore/jqboson/jqboson/core/loop/loop.py", line 153, in _handle_queue
    message.callback(**message.callback_data)
  File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_subscriber.py", line 228, in broadcast
    consumer.send(market_data)
  File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_consumer_manager.py", line 59, in   consumer_gen
    msg_callback()
  File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_consumer_manager.py", line 52, in msg_callback
    callback(market_data)
  File "/tmp/jqcore/jqboson/jqboson/core/mds/market_data_consumer_manager.py", line 122, in wrapper
    result = callback(*args, **kwargs)
  File "/tmp/jqcore/jqboson/jqboson/core/strategy.py", line 474, in _wrapper
    self._context.current_dt
  File "/tmp/strategy/user_code.py", line 114, in handle_data
    rows = table.find_all('tr')
AttributeError: 'NoneType' object has no attribute 'find_all'

那么总结一下，我应该使用哪一个呢？Selium还是Beautiful汤？或者甚至是其他东西？我应该如何解决这个问题？

	SECURITY_CODE	DERIVE_SECURITY_CODE	SECURITY_NAME	CHANGE_DATE	PERSON_NAME	CHANGE_SHARES	AVERAGE_PRICE	CHANGE_AMOUNT	CHANGE_REASON	CHANGE_RATIO	CHANGE_AFTER_HOLDNUM	HOLD_TYPE	DSE_PERSON_NAME	POSITION_NAME	PERSON_DSE_RELATION	ORG_CODE	GGEID	BEGIN_HOLD_NUM	END_HOLD_NUM
0	000001	000001.SZ	平安银行	2021-09-06 00:00:00	谢永林	26700	18.01	480867	竞价交易	0.0001	26700	A股	谢永林	董事	本人	10004085	173000004782302008	nan	26700
1	000001	000001.SZ	平安银行	2021-09-06 00:00:00	项有志	4000	18.46	73840	竞价交易	0.0001	26000	A股	项有志	董事,副行长,首席财务官	本人	10004085	173000004782302010	nan	26000
...
32	000001	000001.SZ	平安银行	2009-08-19 00:00:00	刘巧莉	46200	21.04	972048	竞价交易	0.0015	nan	A股	马黎民	监事	配偶	10004085	140000000281406241	nan	nan
33	000001	000001.SZ	平安银行	2007-07-09 00:00:00	王魁芝	1600	27.9	44640	二级市场买卖	0.0001	7581	A股	王魁芝	监事	本人	10004085	173000001049726006	5981	7581

Python 如何将桌子刮成带有Se的筷子要求Beautiful Soup

推荐答案

Example

Python相关问答推荐

计算每月过go x年的平均值

如何修复fpdf中的线路出血

如何判断. text文件中的某个字符，然后读取该行

FastAPI：使用APIRouter路由子模块功能

来自ARIMA结果的模型方程

在后台运行的Python函数

从 struct 类型创建MultiPolygon对象，并使用Polars列出[list[f64]列

如何防止Plotly在输出到PDF时减少行中的点数？

使用from_pandas将GeDataFrame转换为polars失败，ArrowType错误：未传递numpy. dype对象

Python中的负前瞻性regex遇到麻烦

通过优化空间在Python中的饼图中添加标签

带条件计算最小值

数据抓取失败：寻求帮助

如何创建一个缓冲区周围的一行与manim？

无法在Docker内部运行Python的Matlab SDK模块，但本地没有问题

joblib：无法从父目录的另一个子文件夹加载转储模型

删除marplotlib条形图上的底边

用渐近模计算含符号的矩阵乘法

合并帧，但不按合并键排序

合并与拼接并举