Python 使用 dask 从 zip 文件读取多个 csv

发布于08月16日

根据thisAnswer的回答，我一直在try 使用DASK从压缩目录中读取多个CSV.但是，我收到了一条很长的错误消息，我无法理解.我认为最重要的一句话是:

msgpack.exceptions.ExtraData: unpack(b) received extra data.个

这data台是公开发售的.

import numpy as np
import pandas as pd
import dask.dataframe as dd

# read data, the dask way
df = dd.read_csv('zip://BACI*.csv', sep=",", dtype={"k":str, "i":int, "j":int, "t":int}, storage_options={'fo': '../input/baci_hs92.zip'})
df.head()

我相信这种飞过的提取应该在Dask中起作用，我宁愿不像other个答案建议的那样将所有文件解压缩到某个目录中.

推荐答案

以下是等效的和有效的方法:

In [1]: u = "http://www.cepii.fr/DATA_DOWNLOAD/baci/data/BACI_HS92_V202301.zip"

In [2]: import numpy as np
   ...: import pandas as pd
   ...: import dask.dataframe as dd
   ...:
   ...: # read data, the dask way
   ...: df = dd.read_csv(f'zip://BACI*.csv::{u}', sep=",", dtype={"k":str, "i":int, "j":int, "t":int})
   ...: df.head()

然而，这是很慢的，因为Dask会抢先读取压缩数据的块(必须从每个成员文件的开始扫描，因为Eflate很难)以找到换行偏移量.

相反，如果您添加blocksize=None，则前期成本要小得多，因为不需要查找换行符；然而，即使获得.head()也需要读取第一个压缩文件的整个内容.此外，它显示第"q"列的数据类型不匹配，可能是因为用于猜测的前几行都有数字，但后来在同一列中有对象类型的内容.

Kerchunk项目对查找和索引CSV(https://github.com/fsspec/kerchunk/issues/66)中的换行符和索引ZIP/GZIP文件(https://github.com/fsspec/kerchunk/issues/281)感兴趣，这意味着一旦someone完成前期索引工作，就可以像这样快速并行访问数据.此功能尚不存在.