I'm reading a root file using uproot and converting parts of it into a DataFrame using the arrays method.
This works fine, until I try to save to parquet using the to_parquet method on the dataframe. Sample code is given below.

# First three lines are here to rename the columns and choose what data to keep
data = pd.read_csv(dictFile, header = None, delim_whitespace=True)
dataFile, dataKey = data[0], data[1]
content_ele = {dataKey[i]: dataFile[i] for i in np.arange(len(dataKey))}

# We run over the different files to save a simplified version of them.
file_list = pd.read_csv(file_list_loc, names=["Loc"])

for file_loc in file_list.Loc:

    tree = uproot.open(f"{file_path}/{file_loc}:CollectionTree")

    arrays = tree.arrays(dataKey, library="pd").rename(columns=content_ele)

    save_loc = f"{save_path}/{file_loc[:-6]}reduced.parquet"
    arrays.to_parquet(path=save_loc)

Doing so, results in the following error: _arrow_array_() got an unexpected keyword argument 'type'
It seems to originate from pa.array, if that helps out.

值得注意的是,我遇到过这个错误的最简单的数据帧有2列,每行都有不同长度的尴尬数组(尴尬的.高水平数组),但每列都是相同的.下面给出一个例子.

           A                      B
0   [31, 26, 17, 23]    [-2.1, 1.3, 0.5, -0.4]
1   [75, 15, 49]        [2.4, -1.8, 0.8] 
2   [58, 45, 64, 47]    [-1.9, -0.4, -2.5, 1.3]
3   [26]                [-1.1] 

I've tried both reducing what elements I run on, such as only integers, reducing amount of columns as above.
However, running this exact same method with to_json gives no errors. The problem with that method is that once I read it again, what was previously awkward arrays are now just lists, making it much more impractical to work with whenever I may want to calculate something like array.A/2. Yes, I could just convert it, but it seems wiser to keep the original format and it is easier since I don't have to do it each time.

推荐答案

Solution:升级您的awkward-pandas套餐.当我第一次try 在awkward-pandas版本2022.12a1中重现您的问题时,我看到了相同的错误,然后我升级到2023.8.0,它就消失了.

我把这一切都写下来是因为我为自己感到骄傲.:)

guessing岁了,f"{file_path}/{file_loc}:CollectionTree"里的数据是参差不齐的.在您的示例中没有指明这一点,但如果它是纯粹的数字数据类型(没有可变长度的列表或嵌套数据 struct ),那么arrays将是一个普通的Pandas DataFrame.在这种情况下,如果你遇到错误,那就是Pandas 错误--这是可能的,但可能性较小,因为其他人会最先注意到它.

因此,假设arrays是粗糙数据的DataFrame(这是uroot>;=5.0),则每列中的数据类型被管理为awkward-pandas.如果是这样的话,我应该能够像这样重现错误:

>>> import awkward as ak
>>> import pandas as pd
>>> import awkward_pandas
>>> ragged_array = ak.Array([[0, 1, 2], [], [3, 4], [5], [6, 7, 8, 9]])
>>> ak_ext_array = awkward_pandas.AwkwardExtensionArray(ragged_array)
>>> df = pd.DataFrame({"column": ak_ext_array})
>>> df
         column
0     [0, 1, 2]
1            []
2        [3, 4]
3           [5]
4  [6, 7, 8, 9]
>>> df.to_parquet("/tmp/file.parquet")

我做到了(使用awkward-pandas版本2022.12a1):

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jpivarski/mambaforge/lib/python3.9/site-packages/pandas/core/frame.py", line 2889, in to_parquet
    return to_parquet(
  File "/home/jpivarski/mambaforge/lib/python3.9/site-packages/pandas/io/parquet.py", line 411, in to_parquet
    impl.write(
  File "/home/jpivarski/mambaforge/lib/python3.9/site-packages/pandas/io/parquet.py", line 159, in write
    table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
  File "pyarrow/table.pxi", line 3480, in pyarrow.lib.Table.from_pandas
  File "/home/jpivarski/mambaforge/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 609, in dataframe_to_arrays
    arrays = [convert_column(c, f)
  File "/home/jpivarski/mambaforge/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 609, in <listcomp>
    arrays = [convert_column(c, f)
  File "/home/jpivarski/mambaforge/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 590, in convert_column
    result = pa.array(col, type=type_, from_pandas=True, safe=safe)
  File "pyarrow/array.pxi", line 263, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 110, in pyarrow.lib._handle_arrow_array_protocol
TypeError: __arrow_array__() got an unexpected keyword argument 'type'

(对于future :包括整个堆栈跟踪将消除大量猜测.)

我首先想到的是,"也许awkward-pandas还没有实现__arrow_array__ protocol."但事实并非如此,AwkwardExtensionArray的方法是__arrow_array__:

>>> ak_ext_array.__arrow_array__()
<pyarrow.lib.ChunkedArray object at 0x7ff422d0b9f0>
[
  [
    [
      0,
      1,
      2
    ],
    [],
    ...
    [
      5
    ],
    [
      6,
      7,
      8,
      9
    ]
  ]
]

然后,"可能它有一个__arrow_array__方法,但该方法不接受type参数",这就是错误消息所说的.

>>> help(ak_ext_array.__arrow_array__)
Help on method __arrow_array__ in module awkward_pandas.array:
__arrow_array__() method of awkward_pandas.array.AwkwardExtensionArray instance

啊哈!就这样!所以我正要写一个issue on awkward-pandas,在这样做的时候,指出缺少type参数的函数定义.但是函数定义不会遗漏type个参数.

https://github.com/intake/awkward-pandas/blob/1f8cf19fdc9cb0786642f39cfaf7c084c3c5c9bc/src/awkward_pandas/array.py#L148-L151

只是我的包裹复印件太旧了.这是一个旧的错误,后来被修复了.

我升级了我的awkward-pandas,现在一切都好用了:

>>> df.to_parquet("/tmp/file.parquet")

(no errors)

>>> ak.from_parquet("/tmp/file.parquet").show()
[{column: [0, 1, 2]},
 {column: []},
 {column: [3, 4]},
 {column: [5]},
 {column: [6, 7, 8, 9]}]

(reads back appropriately)

Python相关问答推荐

在for循环中保存和删除收件箱

如何从不同长度的HTML表格中抓取准确的字段?

如何判断. text文件中的某个字符,然后读取该行

如何使用scikit-learn Python库中的Agglomerative集群算法以及集群中声明的对象数量?

如何观察cv2.erode()的中间过程?

如何使用上下文管理器创建类的实例?

分组数据并删除重复数据

在Google Colab中设置Llama-2出现问题-加载判断点碎片时Cell-run失败

使用miniconda创建环境的问题

为什么符号没有按顺序添加?

Polars:用氨纶的其他部分替换氨纶的部分

数据抓取失败:寻求帮助

修复mypy错误-赋值中的类型不兼容(表达式具有类型xxx,变量具有类型yyy)

Godot:需要碰撞的对象的AdditionerBody2D或Area2D以及queue_free?

部分视图的DataFrame

driver. find_element无法通过class_name找到元素'""

当递归函数的返回值未绑定到变量时,非局部变量不更新:

Matplotlib中的字体权重

如何防止Pandas将索引标为周期?

如何将数据帧中的timedelta转换为datetime