I have a very, very large JSON file (1000+ MB) of identical JSON objects. For example:
[
{
"id": 1,
"value": "hello",
"another_value": "world",
"value_obj": {
"name": "obj1"
},
"value_list": [
1,
2,
3
]
},
{
"id": 2,
"value": "foo",
"another_value": "bar",
"value_obj": {
"name": "obj2"
},
"value_list": [
4,
5,
6
]
},
{
"id": 3,
"value": "a",
"another_value": "b",
"value_obj": {
"name": "obj3"
},
"value_list": [
7,
8,
9
]
},
...
]
Every single item in the root JSON list follows the same structure and thus would be individually deserializable. I already have the C# classes written to receive this data, and deserializing a JSON file containing a single object without the list works as expected.
At first, I tried to just directly deserialize my objects in a loop:
JsonSerializer serializer = new JsonSerializer();
MyObject o;
using (FileStream s = File.Open("bigfile.json", FileMode.Open))
using (StreamReader sr = new StreamReader(s))
using (JsonReader reader = new JsonTextReader(sr))
{
while (!sr.EndOfStream)
{
o = serializer.Deserialize<MyObject>(reader);
}
}
这不起作用,抛出了一个异常,清楚地声明需要的是对象,而不是列表.我的理解是,此命令将只读取JSON文件根级别包含的单个对象,但由于我们有list个对象,因此这是一个无效请求.
My next idea was to deserialize as a C# List of objects:
JsonSerializer serializer = new JsonSerializer();
List<MyObject> o;
using (FileStream s = File.Open("bigfile.json", FileMode.Open))
using (StreamReader sr = new StreamReader(s))
using (JsonReader reader = new JsonTextReader(sr))
{
while (!sr.EndOfStream)
{
o = serializer.Deserialize<List<MyObject>>(reader);
}
}
This does succeed. However, it only somewhat reduces the issue of high RAM usage. In this case it does look like the application is deserializing items one at a time, and so is not reading the entire JSON file into RAM, but we still end up with a lot of RAM usage because the C# List object now contains all of the data from the JSON file in RAM. This has only displaced the problem.
然后,我决定在进入循环之前,通过执行sr.Read()
,简单地try 从流的开头删除一个字符(以消除[
个字符).然后,第一个对象确实读取成功,但后续对象没有成功,只有"意外令牌"例外.我猜这是逗号和对象之间的空格,这会让读者产生反感.
简单地删除方括号是行不通的,因为对象确实包含它们自己的基本列表,正如您在示例中看到的那样.即使try 使用},
作为分隔符也行不通,因为正如您所看到的,对象中存在子对象.
我的目标是,能够从流中一次读取一个对象.读取一个对象,对其进行处理,然后从RAM中丢弃它,然后读取下一个对象,依此类推.这将消除将整个JSON字符串或数据的全部内容作为C#对象加载到RAM中的需要.
我遗漏了什么?