我正在构建一个完全自动化的求职应用程序,有趣的是,自动化部分相当简单,但是废弃的部分并不多.
简而言之,requests
+beautifulsoup
适用于我正在废弃的大多数域,但在workable页上try 相同的过程时,没有任何效果:
import requests
from bs4 import BeautifulSoup as bs
session = requests.Session()
url = 'https://apply.workable.com/breederdao-1/j/602097ACC9/'
req = session.get(url)
title = soup.find('h1', {'data-ui': 'job-title'})
print(title)
>>> None
details = soup.find('span', {'data-ui': 'job-location'})
print(details)
>>> None
这两个元素都小于body
,但是当我try 获取页面标题时,我确实得到了我所期望的:
title_0 = soup.find('title')
print(title_0)
>>> <title>Data Analyst (Fully Remote) - BreederDAO</title>
我也try 过使用await
+HTMLSEssion
/AsyncHTMLSession
,但只要元素在body
之内,每find()
个仍然返回None
个.
有人能教我这个吗?我目前的假设是,该网站有某种反报废机制,但我甚至不知道从哪里开始寻找.不过,这个元素看起来确实很特别:
<html...
<head>...</head>
<body>
.
.
.
<noscript>
<iframe height="0" width="0" src="https://www.googletagmanager.com/ns.html?id=GTM-WKS7WTT&gtm_auth=SGnzIn3pcB7S4fevFXOKPQ&gtm_preview=env-2&gtm_cookies_win=x" style="display: none; visibility: hidden;">
#document
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>ns</title>
</head>
<body>
" "
</body>
</html>
</iframe>
</noscript>
.
.
.
</body>
</html>