Current problem.个 我正在处理一组数据文件,它们基本上如下所示:
{39107,
{31685,
{ f24c4ec6-1e59-47a0-9736-8c823eda0d28,
"N",
7
},
{ c71dce36-4295-49e4-be03-7c60969b96c3,
"A",
8
},
{ f80fce14-f001-4b20-84d5-7a00f0788f6b,
"A",
9
},
}
}
和
{0,
{4659,
{
7c90ea6a-12f5-4c54-bfe0-e38120a6e364,
"fieldname27472",
"N",
27472,
"",
{3,
{"field1",
0,
{1,
{
"B",
16,
0,
"",
0
}
},
"",
0
},
{"field2",
0,
{1,
{
"T",
0,
0,
"",
0}
},
"",
0
},
{"field3",
0,
{1,
{
"L",
0,
0,
"",
0
}
},
"",
0
},
},
{0},
{1,
{
edcba,
"ByID",
abcde,
1,
{1,
"ID"
},
1,
0,
0
}
},
1,
"S",
{0},
{0},
"",
0,
0
}
}
}
数据集之前的数字,例如4659表示以下数据容器的数量. 有些值没有用引号括起来,比如本例中的UUID或随机字符串.
我的目标是将这些数据 struct 转换为像列表或元组这样的Python对象,然后将它们转换为JSON以供外部处理.
现在我有一个分两个阶段的过程. Stage1进行初始转换和数据判断. Stage2过滤数据,删除过多的值(如实际元素之前的元素数量)和嵌套列表.
import json
file = 'stack1.json'
def stage1(msg):
buffer = ''
st,fh,delim,encase = '[',']',',', '"'
msg = msg.translate(str.maketrans('{}',st+fh)).replace('\n', '').replace('\r', '').replace('\t', '')
while True:
fhpos = msg.find(fh)
if fhpos >= 0:
head = msg[:fhpos+1]
if head:
stpos = head.rfind(st)
if stpos>=0:
teststring = head[stpos+1:fhpos].split(delim)
for idx,sent in enumerate(teststring):
if not (sent.startswith(encase) or sent.endswith(encase)) or sent.count('-') == 4:
teststring[idx] = (f'"{teststring[idx]}"')
break
buffer+= head[:stpos+1]+','.join(teststring)+fh
else: buffer+=fh
msg = msg[fhpos+1:]
else:
break
return buffer
def stage2(lst):
if not any([isinstance(i,list) for i in lst]):
return tuple(lst)
if not isinstance(lst[0],list) and all([isinstance(j,list) for j in lst[1:]]):
lst = stage2(lst[1:])
if all([isinstance(j,(list,tuple)) for j in lst]) and len(lst) == 1:
lst, = lst
for idx,i in enumerate(lst):
if isinstance(i,list):
lst[idx] = stage2(i)
else:
continue
return stage2(lst)
with open(file, 'r') as f:
data = f.read()
try:
s1 = stage1(data)
print("STAGE1\n",s1)
s2 = stage2(json.loads(s1))
print("STAGE2\n",json.dumps(s2, indent=2))
except Exception as e: print(e)
Current results:
示例1:
STAGE1
[39107,[31685,["f24c4ec6-1e59-47a0-9736-8c823eda0d28","N",7],["c71dce36-4295-49e4-be03-7c60969b96c3","A",8],["f80fce14-f001-4b20-84d5-7a00f0788f6b","A",9]]]
STAGE2
[
[
"f24c4ec6-1e59-47a0-9736-8c823eda0d28",
"N",
7
],
[
"c71dce36-4295-49e4-be03-7c60969b96c3",
"A",
8
],
[
"f80fce14-f001-4b20-84d5-7a00f0788f6b",
"A",
9
]
]
示例2:
STAGE1
[0,[4659,[7c90ea6a-12f5-4c54-bfe0-e38120a6e364,"fieldname27472","N",27472,"",[3,[aa-aa-a-a-a,"field1",0,[1,["B","16",0,"",0]]],["field2",0,[1,["T","0",0,"",0]]],["field3",0,[1,["L","0",0,"",0]]]],["0"],[1,[edcba,"ByID",abcde,1,["1","ID"]]],1,"S",["0"],["0"]]]]
STAGE2
Expecting ',' delimiter: line 1 column 12 (char 11)
示例2失败,因为并非所有值都有引号.
什么自由可能适合这种情况? 数据集相当大,目前第一个示例是~5M个字符,Stage1处理最多需要1分钟.
一百零二 像这样转换和过滤数据的最佳方法是什么? 我认为一次转换和过滤要比多次执行全扫描要快. 我读了大约PLY和PEG,但我不认为这是适合这项工作的工具.