I'm reading a root file using uproot and converting parts of it into a DataFrame using the arrays method.
This works fine, until I try to save to parquet using the to_parquet method on the dataframe. Sample code is given below.
# First three lines are here to rename the columns and choose what data to keep
data = pd.read_csv(dictFile, header = None, delim_whitespace=True)
dataFile, dataKey = data[0], data[1]
content_ele = {dataKey[i]: dataFile[i] for i in np.arange(len(dataKey))}
# We run over the different files to save a simplified version of them.
file_list = pd.read_csv(file_list_loc, names=["Loc"])
for file_loc in file_list.Loc:
tree = uproot.open(f"{file_path}/{file_loc}:CollectionTree")
arrays = tree.arrays(dataKey, library="pd").rename(columns=content_ele)
save_loc = f"{save_path}/{file_loc[:-6]}reduced.parquet"
arrays.to_parquet(path=save_loc)
Doing so, results in the following error: _arrow_array_() got an unexpected keyword argument 'type'
It seems to originate from pa.array, if that helps out.
值得注意的是,我遇到过这个错误的最简单的数据帧有2列,每行都有不同长度的尴尬数组(尴尬的.高水平数组),但每列都是相同的.下面给出一个例子.
A B
0 [31, 26, 17, 23] [-2.1, 1.3, 0.5, -0.4]
1 [75, 15, 49] [2.4, -1.8, 0.8]
2 [58, 45, 64, 47] [-1.9, -0.4, -2.5, 1.3]
3 [26] [-1.1]
I've tried both reducing what elements I run on, such as only integers, reducing amount of columns as above.
However, running this exact same method with to_json gives no errors. The problem with that method is that once I read it again, what was previously awkward arrays are now just lists, making it much more impractical to work with whenever I may want to calculate something like array.A/2
. Yes, I could just convert it, but it seems wiser to keep the original format and it is easier since I don't have to do it each time.