我想收集美国农业部的数据,特别是https://mymarketnews.ams.usda.gov/viewReport/2960,我收集了2020年9月14日至7月的所有最新数据.他们似乎改变了存储数据的方式,所以从2020年7月到2017年10月,他们以文本文件的形式存储,我想挖掘这些文件,但我找不到可靠的方法来做到这一点.
我已经try 了多种方法,到目前为止,我的selenium导入出现了问题,无法使用ChromeDriver,这可能会解决我的问题
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
URL = 'https://mymarketnews.ams.usda.gov/viewReport/2960'
def extract_text_file_links(driver):
# Modify the selector as necessary
elements = driver.find_elements(By.CSS_SELECTOR, 'a[href$=".txt"]')
return [element.get_attribute('href') for element in elements]
def main():
# Set up the webdriver
options = webdriver.ChromeOptions()
options.headless = True # This runs Chrome in the background
driver = webdriver.Chrome(executable_path='changed path', options=options)
try:
driver.get(URL)
# Waiting for a specific element to ensure the page has loaded.
# Adjust the selector and timeout as necessary.
WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.CSS_SELECTOR, 'a[href$=".txt"]')))
text_file_links = extract_text_file_links(driver)
for link in text_file_links:
print(link)
# You can then download these files as demonstrated in previous answers.
finally:
driver.quit()
if __name__ == "__main__":
main()
ValueError: Timeout value connect was <object object at 0x0000020075CC4820>, but it must be an int, float or None.
不管怎样,我都不能让这个刮刀工作,文件都在网站上的一个文件夹里,我不知道如何正确地刮掉这样的东西