Python3.x 公开数据中的卫星图像网页抓取优化

发布于06月17日

import re
import requests
from bs4 import BeautifulSoup

webpage = 'https://xgis.maaamet.ee/xgis2/page/app/ristipuud'

response = requests.get(site)

bsoup = BeautifulSoup(response.text, 'html.parser')
img_tags = soup.find_all('img')

urls = \[img\['src'\] for img in img_tags\]

for url in urls:
filename = re.search(r'/(\[\\w\_-\]+\[.\](jpg|gif|tif|png))$', url)
if not filename:
print("didn't match with the url: {}".format(url))
continue
with open(filename.group(1), 'wb') as f:
if 'http' not in url:
url = '{}{}'.format(webpage, url)
response = requests.get(url)
f.write(response.content)`

我想从这个网站下载卫星图像(链接:https://xgis.maaamet.ee/xgis2/page/app/ristipuud).大约有6000张TIF格式的卫星图像.其中，我想为我的研究拿到500英镑.我必须经常重复同样的过程，所以我想通过刮来刮go .但我有点问题.当我运行这段代码时，它没有显示任何错误，但也没有下载任何数据.网站上的图像被分成瓷砖，可以通过使用该链接https://geoportaal.maaamet.ee/eng/Maps-and-Data/Orthophotos/Download-Orthophotos-p662.html的瓷砖编号进行搜索来单独下载.RGB正射照片以.tif格式的压缩文件形式提供.有多个版本的图像取决于年份，我想得到最新的一个.但是，不幸的是，我的代码不起作用.你能帮我找出代码中的错误或分享你的经验吗？我是编程新手，正在努力学习更多.

import time import requests from bs4 import BeautifulSoup import os def download_url(url, save_path, chunk_size=128): r = requests.get(url, stream=True) with open(save_path, 'wb') as fd: for chunk in r.iter_content(chunk_size=chunk_size): fd.write(chunk) def get_file_name(url): tokens = url.split("&") for token in tokens: if(token[:2] == 'f='): return token[2:] return '' # Start timer start_time = time.time() print("Start time: ", start_time) # create image directory image_directory = 'images' isExist = os.path.exists(image_directory) if not isExist: os.makedirs(image_directory) # get zip URL and file name start_sheet = 44744 end_sheet = 44844 # you need to change with 74331, I just test 100 range total_download = 0 for index in range(start_sheet, end_sheet): template = "https://geoportaal.maaamet.ee/index.php?lang_id=2&plugin_act=otsing&page_id=662&&kaardiruut={sheet_number:n}&andmetyyp=ortofoto_eesti_rgb" webpage = template.format(sheet_number = index) response = requests.get(webpage) if (response.status_code == 200): soup = BeautifulSoup(response.content, "html.parser") link = soup.find("a") if link is not None: url = 'https://geoportaal.maaamet.ee/' + link['href'] file_name = get_file_name(url) print(file_name) # save zip file download_url(url, './' + image_directory + '/' + get_file_name(url)) total_download = total_download + 1 # End timer end_time = time.time() # Calculate elapsed time elapsed_time = end_time - start_time print("Elapsed time: ", elapsed_time) print("Total Download zip files: ", total_download)

Python3.x 公开数据中的卫星图像网页抓取优化

推荐答案

Python-3.x相关问答推荐

类型的可变性对变量的作用域有影响吗？

如何使用regex将电话号码和姓名从文本字符串中分离出来

我想判断df_entry_log[AM_PM]，并根据测试填充列

根据第一个字典的值序列对第二个字典进行排序

合并两个数据帧并对某些总和进行求和

在Python中基于组/ID将两个数据帧进行映射，找出较接近的值

在特定条件下从 DataFrame 中提取特定组

如何将 OLS 趋势线添加到使用 updatemenus 显示数据子集的 plotly 散点图图形对象？

解包时是否可以指定默认值？

在 jupyter notebook 的单元格中使用 sudo

如何从字典中打印特定键值？

运行 PyCharm 测试时如何解决django.core.exceptions.ImproperlyConfigured：找不到 GDAL 库？

在不关心项目的情况下运行生成器功能的更简单方法

python - 使用 matplotlib 和 boto 将绘图从内存上传到 s3

Python 3 - Zip 是 pandas 数据框中的迭代器

警告：请使用 tensorflow/models 中的官方/mnist/dataset.py 等替代方案

如何在python中创建代码对象？

命名参数可以与 Python 枚举一起使用吗？

什么是ANSI_X3.4-1968编码？

Python 无法处理以 0 开头的数字字符串.为什么？