我经常以这种方式使用数十GB的数据
关于如何存储数据的几点建议,值得阅读the docs和late in this thread.
Details which will affect how you store your data, like:
Give as much detail as you can; and I can help you develop a structure.
- 数据大小、行数、列数、列类型;是否要追加
行,还是只列?
- 典型的操作会是什么样子.例如,对列执行查询以 Select 一组行和特定列,然后执行操作(在内存中),创建新列,保存这些列
- 处理完之后,你会做什么呢?步骤2是临时的,还是可重复的?
- 输入平面文件:数量,大致总大小(Gb).这些是如何组织的,例如通过记录?是每个文件都包含不同的字段,还是每个文件都有一些记录,其中包含每个文件中的所有字段?
- Do you ever select subsets of rows (records) based on criteria (e.g. select the rows with field A > 5)? and then do something, or do you just select fields A, B, C with all of the records (and then do something)?
- 你是否"处理"了你所有的专栏(分组),或者你是否有一个很好的比例可以只用于报告(例如,你想保留数据,但在最终结果之前不需要明确地拉入该专栏)?
解决方案
Ensure you have 100 installed.
阅读iterating files chunk-by-chunk和multiple table queries.
由于pytables已优化为按行操作(这是您查询的内容),因此我们将为每组字段创建一个表.这样就可以很容易地 Select 一小组字段(适用于大表格,但这样做效率更高…….)我想我将来也许能解决这个限制……这无论如何更直观):
(以下是伪代码.)
import numpy as np
import pandas as pd
# create a store
store = pd.HDFStore('mystore.h5')
# this is the key to your storage:
# this maps your fields to a specific group, and defines
# what you want to have as data_columns.
# you might want to create a nice class wrapping this
# (as you will want to have this map and its inversion)
group_map = dict(
A = dict(fields = ['field_1','field_2',.....], dc = ['field_1',....,'field_5']),
B = dict(fields = ['field_10',...... ], dc = ['field_10']),
.....
REPORTING_ONLY = dict(fields = ['field_1000','field_1001',...], dc = []),
)
group_map_inverted = dict()
for g, v in group_map.items():
group_map_inverted.update(dict([ (f,g) for f in v['fields'] ]))
读取文件并创建存储(基本上做append_to_multiple
所做的事情):
for f in files:
# read in the file, additional options may be necessary here
# the chunksize is not strictly necessary, you may be able to slurp each
# file into memory in which case just eliminate this part of the loop
# (you can also change chunksize if necessary)
for chunk in pd.read_table(f, chunksize=50000):
# we are going to append to each table by group
# we are not going to create indexes at this time
# but we *ARE* going to create (some) data_columns
# figure out the field groupings
for g, v in group_map.items():
# create the frame for this group
frame = chunk.reindex(columns = v['fields'], copy = False)
# append it
store.append(g, frame, index=False, data_columns = v['dc'])
现在文件中有了所有的表(实际上,如果愿意,可以将它们存储在单独的文件中,可能需要将文件名添加到组映射中,但这可能不是必需的).
下面是获取列和创建新列的方法:
frame = store.select(group_that_I_want)
# you can optionally specify:
# columns = a list of the columns IN THAT GROUP (if you wanted to
# select only say 3 out of the 20 columns in this sub-table)
# and a where clause if you want a subset of the rows
# do calculations on this frame
new_frame = cool_function_on_frame(frame)
# to 'add columns', create a new group (you probably want to
# limit the columns in this new_group to be only NEW ones
# (e.g. so you don't overlap from the other tables)
# add this info to the group_map
store.append(new_group, new_frame.reindex(columns = new_columns_created, copy = False), data_columns = new_columns_created)
当您准备好进行后期处理时:
# This may be a bit tricky; and depends what you are actually doing.
# I may need to modify this function to be a bit more general:
report_data = store.select_as_multiple([groups_1,groups_2,.....], where =['field_1>0', 'field_1000=foo'], selector = group_1)
关于DATA_COLUMNS,实际上不需要定义ANY个DATA_COLUMNS;它们允许您根据列再 Select 行.例如,类似于:
store.select(group, where = ['field_1000=foo', 'field_1001>0'])
在最终报告生成阶段,它们可能对您最感兴趣(基本上,一个数据列与其他列是分离的,如果您定义了很多,这可能会在一定程度上影响效率).
您可能还希望执行以下操作:
- 创建一个函数,该函数获取字段列表,在groups_映射中查找组,然后 Select 这些组并连接结果,从而得到结果帧(这基本上就是select_as_multiple所做的).This way the structure would be pretty transparent to you.
- 某些数据列上的索引(使行子集设置更快).
- 启用压缩.
如果有问题,请告诉我!