我试图将以下深度嵌套的JSON转换为(最终)csv.下面只是一个示例,完整的JSON非常庞大(12GB).

{'reporting_entity_name':'Blue Cross and Blue Shield of Alabama', 
'reporting_entity_type':'health insurance issuer', 
'last_updated_on':'2022-06-10',
'version':'1.1.0', 
'in_network':[
            {'negotiation_arrangement': 'ffs', 
            'name': 'xploration of Kidney', 
            'billing_code_type': 'CPT', 
            'billing_code_type_version': '2022', 
            'billing_code': '50010', 
            'description': 'Renal Exploration, Not Necessitating Other Specific Procedures', 
            'negotiated_rates': [{
                'negotiated_prices': [
                    {'negotiated_type': 'negotiated', 
                    'negotiated_rate': 993.0, 
                    'expiration_date': '2022-06-30', 
                    'service_code': ['21', '22', '24'], 
                    'billing_class': 'professional'}, 
                    {'negotiated_type': 'negotiated', 
                    'negotiated_rate': 1180.68, 
                    'expiration_date': '2022-06-30', 
                    'service_code': ['21', '22', '24'], 
                    'billing_class': 'professional'}, 
                    {'negotiated_type': 'negotiated', 
                    'negotiated_rate': 1283.95, 
                    'expiration_date': '2022-06-30', 
                    'service_code': ['21', '22', '24'], 
                    'billing_class': 'professional'}, 
                    {'negotiated_type': 'negotiated', 
                    'negotiated_rate': 1042.65, 
                    'expiration_date': '2022-06-30', 
                    'service_code': ['21', '22', '24'], 
                    'billing_class': 'professional'}, 
                    {'negotiated_type': 'negotiated', 
                    'negotiated_rate': 1290.9, 
                    'expiration_date': '2022-06-30', 
                    'service_code': ['21', '22', '24'], 
                    'billing_class': 'professional'}, 
                    {'negotiated_type': 'negotiated', 
                    'negotiated_rate': 1241.25, 
                    'expiration_date': '2022-06-30', 
                    'service_code': ['21', '22', '24'], 
                    'billing_class': 'professional'}
                    }]}]}

最终目标是拥有一个数据帧或字典,然后我可以将其写入csv.我希望每一行都有列:

{'reporting_entity_name':'','reporting_entity_type':'','last_updated_on':'','version':'','negotiation_arrangement':'','name':'','billing_code_type':'','billing_code_type_version':'','billing_code':'','description':'','provider_groups':'','negotiated_type':'','negotiated_rate':'','expiration_date':'','service_code':'','billing_class':''}

到目前为止,我已经try 了pandas normalize\u json、flatten和我在GitHub上找到的一些自定义模块.但似乎没有人将数据标准化/扁平化为新行,只有列.因为这是一个如此庞大的数据集,我担心在一堆嵌套循环中递归地执行此操作,因为我担心它会很快耗尽我所有的内存.提前感谢您提供的任何建议!

推荐答案

首先,我必须承认,我是这本书的作者——convtools(github).

假设您的数据是您提供的样本列表:

input_data = [
    {
        "reporting_entity_name": "blue cross and blue shield of alabama",
        "reporting_entity_type": "health insurance issuer",
        "last_updated_on": "2022-06-10",
        "version": "1.1.0",
        "in_network": [
            {
                "negotiation_arrangement": "ffs",
                "name": "xploration of kidney",
                "billing_code_type": "cpt",
                "billing_code_type_version": "2022",
                "billing_code": "50010",
                "description": "renal exploration, not necessitating other specific procedures",
                "negotiated_rates": [
                    {
                        "negotiated_prices": [
                            {
                                "negotiated_type": "negotiated",
                                "negotiated_rate": 993.0,
                                "expiration_date": "2022-06-30",
                                "service_code": ["21", "22", "24"],
                                "billing_class": "professional",
                            },
                            {
                                "negotiated_type": "negotiated",
                                "negotiated_rate": 1180.68,
                                "expiration_date": "2022-06-30",
                                "service_code": ["21", "22", "24"],
                                "billing_class": "professional",
                            },
                        ]
                    }
                ],
            }
        ],
    }
]

我正在定义一种配置,您可以调整它来添加/删除每个级别要检索的字段.然后,该函数递归地构建一个转换器,然后将其编译为无递归的特殊python函数.

from convtools.contrib.tables import Table
from convtools import conversion as c

flattening_levels = [
    {
        "fields": ["reporting_entity_name", "reporting_entity_type"],
        "list_field": "in_network",
    },
    {
        "fields": ["negotiation_arrangement", "name"],
        "list_field": "negotiated_rates",
    },
    {"list_field": "negotiated_prices"},
    {
        "fields": ["negotiated_type", "negotiated_rate"],
    },
]

def build_conversion(levels, index=0):
    level_config = levels[index]

    # basic step is to iterate an input
    return c.iter(
        c.zip(
            # zip nested list
            (
                c.item(level_config["list_field"]).pipe(
                    build_conversion(levels, index + 1)
                )
                if "list_field" in level_config
                else c.naive([{}])
            ),
            # with repeated fields of the current level
            c.repeat(
                {
                    field: c.item(field)
                    for field in level_config.get("fields", ())
                }
            ),
        )
        # update inner objects with fields from the current level
        .iter_mut(c.Mut.custom(c.item(0).call_method("update", c.item(1))))
        # take inner objects, forgetting about the top level
        .iter(c.item(0))
    ).flatten()

flatten_objects = build_conversion(flattening_levels).gen_converter()

Table.from_rows(flatten_objects(input_data)).into_csv("out.csv")

out.csv看起来像:

negotiated_type,negotiated_rate,negotiation_arrangement,name,reporting_entity_name,reporting_entity_type
negotiated,993.0,ffs,xploration of kidney,blue cross and blue shield of alabama,health insurance issuer
negotiated,1180.68,ffs,xploration of kidney,blue cross and blue shield of alabama,health insurance issuer

让我们介绍一下:

import tqdm

many_objects = input_data * 1000000

"""
In [151]: %time Table.from_rows(tqdm.tqdm(flatten_objects(many_objects))).into_csv("out.csv")

2000000it [00:07, 274094.52it/s]
CPU times: user 7.13 s, sys: 141 ms, total: 7.27 s
Wall time: 7.3 s
"""

# resulting file is ~210MB

该解决方案可以优化为不多次更新DICT.如果需要,请告诉我.

Python相关问答推荐

关于两个表达式的区别

根据客户端是否正在传输响应来更改基于Flask的API的行为

语法错误:文档. evaluate:表达式不是合法表达式

极点替换值大于组内另一个极点数据帧的最大值

Pandas:计数器的滚动和,复位

如何在Python中从html页面中提取html链接?

对于标准的原始类型注释,从键入`和`从www.example.com `?

如何关联来自两个Pandas DataFrame列的列表项?

高效地计算数字数组中三行上三个点之间的Angular

我如何为测试函数的参数化提供fixture 生成的数据?如果我可以的话,还有其他 Select 吗?

使用Django标签显示信息

torch 二维张量与三维张量欧氏距离的计算

如何使用Pillow基于二进制掩码设置PNG的RGB值

try 理解PyTorch运行错误:try 再次向后遍历图表

Pandas 滚动着进化的windows

从一组加速度数据到位置的集成

使用子进程的Imagemagick转换函数,从bytesio输入图像

如何在Python中验证邮箱DKIM签名?

我正在试着做一个简单的程序来判断学校的一个项目的调查数据,但它不起作用,有人能帮我吗?

将GroupShuffleSplit与GridSearchCV和CROSS_VAL_SCORE一起使用以进行嵌套交叉验证