我可以通过以下方式一次下载一个文件:

import urllib.request

urls = ['foo.com/bar.gz', 'foobar.com/barfoo.gz', 'bar.com/foo.gz']

for u in urls:
  urllib.request.urlretrieve(u)

我可以试着这样做:

import subprocess
import os

def parallelized_commandline(command, files, max_processes=2):
    processes = set()
    for name in files:
        processes.add(subprocess.Popen([command, name]))
        if len(processes) >= max_processes:
            os.wait()
            processes.difference_update(
                [p for p in processes if p.poll() is not None])

    #Check if all the child processes were closed
    for p in processes:
        if p.poll() is None:
            p.wait()

urls = ['http://www.statmt.org/wmt15/training-monolingual-nc-v10/news-commentary-v10.en.gz',
'http://www.statmt.org/wmt15/training-monolingual-nc-v10/news-commentary-v10.cs.gz', 
'http://www.statmt.org/wmt15/training-monolingual-nc-v10/news-commentary-v10.de.gz']

parallelized_commandline('wget', urls)

有没有办法在不使用os.systemsubprocess作弊的情况下将urlretrieve并行化?

鉴于我现在必须求助于"作弊",subprocess.Popen是下载数据的正确方式吗?

当使用上面的parallelized_commandline()时,wget使用多线程而不是多核,这正常吗?有没有办法让它变成多核而不是多线程?

推荐答案

您可以使用线程池并行下载文件:

#!/usr/bin/env python3
from multiprocessing.dummy import Pool # use threads for I/O bound tasks
from urllib.request import urlretrieve

urls = [...]
result = Pool(4).map(urlretrieve, urls) # download 4 files at a time

您还可以在一个线程中使用asyncio下载多个文件:

#!/usr/bin/env python3
import asyncio
import logging
from contextlib import closing
import aiohttp # $ pip install aiohttp

@asyncio.coroutine
def download(url, session, semaphore, chunk_size=1<<15):
    with (yield from semaphore): # limit number of concurrent downloads
        filename = url2filename(url)
        logging.info('downloading %s', filename)
        response = yield from session.get(url)
        with closing(response), open(filename, 'wb') as file:
            while True: # save file
                chunk = yield from response.content.read(chunk_size)
                if not chunk:
                    break
                file.write(chunk)
        logging.info('done %s', filename)
    return filename, (response.status, tuple(response.headers.items()))

urls = [...]
logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s')
with closing(asyncio.get_event_loop()) as loop, \
     closing(aiohttp.ClientSession()) as session:
    semaphore = asyncio.Semaphore(4)
    download_tasks = (download(url, session, semaphore) for url in urls)
    result = loop.run_until_complete(asyncio.gather(*download_tasks))

这里是url2filename() is defined here.

Python-3.x相关问答推荐

模型中的__str__方法在Django管理面板中生成大量重复查询

丢弃重复的索引,并在多索引数据帧中保留一个

按长度和字母数字对Pandas 数据帧列进行排序

无法使用xpath关闭selenium中的弹出窗口

可以在 Python 的上下文管理器中调用 sys.exit() 吗?

ValueError at /register/ 视图authenticate.views.register_user 未返回HttpResponse 对象.它返回 None 相反

一起使用数据类和枚举

将水平堆叠的数据排列成垂直

matplotlib.pyplot 多边形,具有相同的纵横比和紧凑的布局

Dask worker post-processing

无法在 Windows Python 3.5 上安装 Levenshtein 距离包

Python 错误:IndexError:字符串索引超出范围

if 语句中冒号的语法错误

使用打印时,用+连接是否比用,分隔更有效?

使用 Python3 与 HDFS 交互的最佳模块是什么?

__cause__ 和 __context__ 有什么区别?

如何在继承的数据类中创建可选字段?

导入父目录进行简要测试

Python 3 - Zip 是 pandas 数据框中的迭代器

首次使用后 zip 变量为空