Json Python 将 struct 化文本转换和筛选为对象

发布于07月03日

Current problem.个我正在处理一组数据文件，它们基本上如下所示:

{39107,
    {31685,
        {   f24c4ec6-1e59-47a0-9736-8c823eda0d28,
            "N",
            7
        },
        {   c71dce36-4295-49e4-be03-7c60969b96c3,
            "A",
            8
        },
        {   f80fce14-f001-4b20-84d5-7a00f0788f6b,
            "A",
            9
        },
    }
}

和

{0,
    {4659,
        {
                        7c90ea6a-12f5-4c54-bfe0-e38120a6e364,
                        "fieldname27472",
                        "N",
                        27472,
                        "",
                        {3,
                                {"field1",
                                        0,
                                        {1,
                                                {
                                                        "B",
                                                        16,
                                                        0,
                                                        "",
                                                        0
                                                }
                                        },
                                        "",
                                        0
                                },
                                {"field2",
                                        0,
                                        {1,
                                                {
                                                "T",
                                                0,
                                                0,
                                                "",
                                                0}
                                        },
                                        "",
                                        0
                                },
                                {"field3",
                                        0,
                                        {1,
                                                {
                                                        "L",
                                                        0,
                                                        0,
                                                        "",
                                                        0
                                                }
                                        },
                                        "",
                                        0
                                },
                        },
                        {0},
                        {1,
                                {
                                        edcba,
                                        "ByID",
                                        abcde,
                                        1,
                                        {1,
                                                "ID"
                                        },
                                        1,
                                        0,
                                        0
                                }
                        },
                        1,
                        "S",
                        {0},
                        {0},
                        "",
                        0,
                        0
                }
        }
}

数据集之前的数字，例如4659表示以下数据容器的数量. 有些值没有用引号括起来，比如本例中的UUID或随机字符串.

我的目标是将这些数据 struct 转换为像列表或元组这样的Python对象，然后将它们转换为JSON以供外部处理.

现在我有一个分两个阶段的过程. Stage1进行初始转换和数据判断. Stage2过滤数据，删除过多的值(如实际元素之前的元素数量)和嵌套列表.

import json

file = 'stack1.json'

def stage1(msg):
    buffer = ''
    st,fh,delim,encase = '[',']',',', '"'
    msg = msg.translate(str.maketrans('{}',st+fh)).replace('\n', '').replace('\r', '').replace('\t', '')
    while True:
        fhpos = msg.find(fh)
        if fhpos >= 0:
            head = msg[:fhpos+1]
            if head:
                stpos = head.rfind(st)
                if stpos>=0:
                    teststring = head[stpos+1:fhpos].split(delim)
                    for idx,sent in enumerate(teststring):
                        if not (sent.startswith(encase) or sent.endswith(encase)) or sent.count('-') == 4:
                            teststring[idx] = (f'"{teststring[idx]}"')
                            break
                    buffer+= head[:stpos+1]+','.join(teststring)+fh
                else: buffer+=fh
            msg = msg[fhpos+1:]
        else:
            break
    return buffer

def stage2(lst):
    if not any([isinstance(i,list) for i in lst]):
        return tuple(lst)
    if not isinstance(lst[0],list) and all([isinstance(j,list) for j in lst[1:]]):
        lst = stage2(lst[1:])
        if all([isinstance(j,(list,tuple)) for j in lst]) and len(lst) == 1:
            lst, = lst
    for idx,i in enumerate(lst):
        if isinstance(i,list):
            lst[idx] = stage2(i)
        else:
            continue
    return stage2(lst)

with open(file, 'r') as f:
    data = f.read()
    try:
        s1 = stage1(data)
        print("STAGE1\n",s1)
        s2 = stage2(json.loads(s1))
        print("STAGE2\n",json.dumps(s2, indent=2))
    except Exception as e: print(e)

Current results:

示例1:

STAGE1
[39107,[31685,["f24c4ec6-1e59-47a0-9736-8c823eda0d28","N",7],["c71dce36-4295-49e4-be03-7c60969b96c3","A",8],["f80fce14-f001-4b20-84d5-7a00f0788f6b","A",9]]]
STAGE2
 [
  [
    "f24c4ec6-1e59-47a0-9736-8c823eda0d28",
    "N",
    7
  ],
  [
    "c71dce36-4295-49e4-be03-7c60969b96c3",
    "A",
    8
  ],
  [
    "f80fce14-f001-4b20-84d5-7a00f0788f6b",
    "A",
    9
  ]
]

示例2:

STAGE1
[0,[4659,[7c90ea6a-12f5-4c54-bfe0-e38120a6e364,"fieldname27472","N",27472,"",[3,[aa-aa-a-a-a,"field1",0,[1,["B","16",0,"",0]]],["field2",0,[1,["T","0",0,"",0]]],["field3",0,[1,["L","0",0,"",0]]]],["0"],[1,[edcba,"ByID",abcde,1,["1","ID"]]],1,"S",["0"],["0"]]]]
STAGE2
Expecting ',' delimiter: line 1 column 12 (char 11)

示例2失败，因为并非所有值都有引号.

什么自由可能适合这种情况？数据集相当大，目前第一个示例是~5M个字符，Stage1处理最多需要1分钟.

一百零二像这样转换和过滤数据的最佳方法是什么？我认为一次转换和过滤要比多次执行全扫描要快. 我读了大约PLY和PEG，但我不认为这是适合这项工作的工具.

import re import json def process(s): # replace braces with square brackets s = re.sub(r"^(\s*){\n?", r"\1[\n", s, flags=re.M) s = re.sub(r"}(,?)$", r"]\1", s, flags=re.M) # remove trailing commas (not valid in JSON) s = re.sub(r",$(\s+])", r"\1", s, flags=re.M) # wrap hex in quotes s = re.sub(r'^(\s*)(?=.*[\-a-z])([\w\-]+)', r'\1"\2"', s, flags=re.M) return json.loads(s) with open("stack.json", 'r') as f: data = process(f.read()) print(data)

Json Python 将 struct 化文本转换和筛选为对象

推荐答案

Json相关问答推荐

Jolt Transformation—如果子对象为空，则将父对象更新为空

处理输入数据并转换为更简单的格式-PowerBI

使用JQ将JSON输出转换为CSV复杂 struct

NoneType 对象的 Python 类型错误

Moshi：序列化 List 时出现问题

json 字符串到 Postgres 14 中的表视图

未知的META规范，无法验证.[规范v1.0.1]

将boost：：beast：：multibuffer转换为std：：istream

使用 map_values Select 包含空格的字符串未按预期工作

小写嵌套特定键的震动转换

Jolt 变换 - 如何用字段值重命名字段？

JOLT 在 struct 体中间添加一个 JSON 字段

没有很多类的 GSON 解析

Golang struct 的 XML 和 JSON 标签？

PostgreSQL 中的 JSON 模式验证？

在自定义 JsonConverter 的 ReadJson 方法中处理空对象

如何判断 JSON 响应元素是否为数组？

将循环 struct 转换为 JSON - 有什么方法可以找到它抱怨的字段？

来自 Gson 的 JSON 字符串：删除双引号

无法将 System.String 转换或转换为 Class 对象