Python3.x 如何并行化文件下载

发布于08月04日

我可以通过以下方式一次下载一个文件:

import urllib.request

urls = ['foo.com/bar.gz', 'foobar.com/barfoo.gz', 'bar.com/foo.gz']

for u in urls:
  urllib.request.urlretrieve(u)

我可以试着这样做:

import subprocess
import os

def parallelized_commandline(command, files, max_processes=2):
    processes = set()
    for name in files:
        processes.add(subprocess.Popen([command, name]))
        if len(processes) >= max_processes:
            os.wait()
            processes.difference_update(
                [p for p in processes if p.poll() is not None])

    #Check if all the child processes were closed
    for p in processes:
        if p.poll() is None:
            p.wait()

urls = ['http://www.statmt.org/wmt15/training-monolingual-nc-v10/news-commentary-v10.en.gz',
'http://www.statmt.org/wmt15/training-monolingual-nc-v10/news-commentary-v10.cs.gz', 
'http://www.statmt.org/wmt15/training-monolingual-nc-v10/news-commentary-v10.de.gz']

parallelized_commandline('wget', urls)

有没有办法在不使用os.system或subprocess作弊的情况下将urlretrieve并行化？

鉴于我现在必须求助于"作弊"，subprocess.Popen是下载数据的正确方式吗？

当使用上面的parallelized_commandline()时，wget使用多线程而不是多核，这正常吗？有没有办法让它变成多核而不是多线程？

#!/usr/bin/env python3 from multiprocessing.dummy import Pool # use threads for I/O bound tasks from urllib.request import urlretrieve urls = [...] result = Pool(4).map(urlretrieve, urls) # download 4 files at a time

#!/usr/bin/env python3 import asyncio import logging from contextlib import closing import aiohttp # $ pip install aiohttp @asyncio.coroutine def download(url, session, semaphore, chunk_size=1<<15): with (yield from semaphore): # limit number of concurrent downloads filename = url2filename(url) logging.info('downloading %s', filename) response = yield from session.get(url) with closing(response), open(filename, 'wb') as file: while True: # save file chunk = yield from response.content.read(chunk_size) if not chunk: break file.write(chunk) logging.info('done %s', filename) return filename, (response.status, tuple(response.headers.items())) urls = [...] logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s') with closing(asyncio.get_event_loop()) as loop, \ closing(aiohttp.ClientSession()) as session: semaphore = asyncio.Semaphore(4) download_tasks = (download(url, session, semaphore) for url in urls) result = loop.run_until_complete(asyncio.gather(*download_tasks))

Python3.x 如何并行化文件下载

推荐答案

Python-3.x相关问答推荐

模型中的str方法在Django管理面板中生成大量重复查询

丢弃重复的索引，并在多索引数据帧中保留一个

按长度和字母数字对Pandas 数据帧列进行排序

无法使用xpath关闭selenium中的弹出窗口

可以在 Python 的上下文管理器中调用 sys.exit() 吗？

ValueError at /register/ 视图authenticate.views.register_user 未返回HttpResponse 对象.它返回 None 相反

一起使用数据类和枚举

将水平堆叠的数据排列成垂直

matplotlib.pyplot 多边形，具有相同的纵横比和紧凑的布局

Dask worker post-processing

无法在 Windows Python 3.5 上安装 Levenshtein 距离包

Python 错误：IndexError：字符串索引超出范围

if 语句中冒号的语法错误

使用打印时，用+连接是否比用，分隔更有效？

使用 Python3 与 HDFS 交互的最佳模块是什么？

cause 和 context 有什么区别？

如何在继承的数据类中创建可选字段？

导入父目录进行简要测试

Python 3 - Zip 是 pandas 数据框中的迭代器

首次使用后 zip 变量为空