我正在try 对serde
和serde_json
进行一些有状态的JSON解析.我从How to pass options to Rust's serde that can be accessed in Deserialize::deserialize()?英镑开始 checkout ,虽然我几乎得到了我需要的东西,但我似乎错过了一些至关重要的东西.
我试着做的是两件事:
- 我的JSON非常大--太大了,不能直接将输入读入内存--所以我需要流传输.(FWIW,它也有很多嵌套的层,所以我需要使用
disable_recursion_limit
) - 我需要一些有状态的处理,在那里我可以将一些数据传递给序列化程序,这些数据将影响从输入JSON中保留的数据以及在序列化期间如何转换这些数据.
例如,我的输入可能如下所示:
{ "documents": [
{ "foo": 1 },
{ "baz": true },
{ "bar": null }
],
"journal": { "timestamp": "2023-04-04T08:28:00" }
}
在这里,‘Documents’数组中的每个对象都非常大,我只需要其中的一个子集.不幸的是,我需要首先找到键-值对"documents"
,然后需要访问该数组中的每个元素.目前,我不关心其他键-值对(例如"journal"
),但这种情况可能会改变.
我目前的做法如下:
use serde::de::DeserializeSeed;
use serde_json::Value;
/// A simplified state passed to and returned from the serialization.
#[derive(Debug, Default)]
struct Stats {
records_skipped: usize,
}
/// Models the input data; `Documents` is just a vector of JSON values,
/// but it is its own type to allow custom deserialization
#[derive(Debug)]
struct MyData {
documents: Vec<Value>,
journal: Value,
}
struct MyDataDeserializer<'a> {
state: &'a mut Stats,
}
/// Top-level seeded deserializer only so I can plumb the state through
impl<'de> DeserializeSeed<'de> for MyDataDeserializer<'_> {
type Value = MyData;
fn deserialize<D>(mut self, deserializer: D) -> Result<Self::Value, D::Error>
where
D: serde::Deserializer<'de>,
{
let visitor = MyDataVisitor(&mut self.state);
let docs = deserializer.deserialize_map(visitor)?;
Ok(docs)
}
}
struct MyDataVisitor<'a>(&'a mut Stats);
impl<'de> serde::de::Visitor<'de> for MyDataVisitor<'_> {
type Value = MyData;
fn expecting(&self, formatter: &mut std::fmt::Formatter) -> std::fmt::Result {
write!(formatter, "a map")
}
fn visit_map<A>(self, mut map: A) -> Result<Self::Value, A::Error>
where
A: serde::de::MapAccess<'de>,
{
let mut documents = Vec::new();
let mut journal = Value::Null;
while let Some(key) = map.next_key::<String>()? {
println!("Got key = {key}");
match &key[..] {
"documents" => {
// Not sure how to handle the next value in a streaming manner
documents = map.next_value()?;
}
"journal" => journal = map.next_value()?,
_ => panic!("Unexpected key '{key}'"),
}
}
Ok(MyData { documents, journal })
}
}
struct DocumentDeserializer<'a> {
state: &'a mut Stats,
}
impl<'de> DeserializeSeed<'de> for DocumentDeserializer<'_> {
type Value = Vec<Value>;
fn deserialize<D>(mut self, deserializer: D) -> Result<Self::Value, D::Error>
where
D: serde::Deserializer<'de>,
{
let visitor = DocumentVisitor(&mut self.state);
let documents = deserializer.deserialize_seq(visitor)?;
Ok(documents)
}
}
struct DocumentVisitor<'a>(&'a mut Stats);
impl<'de> serde::de::Visitor<'de> for DocumentVisitor<'_> {
type Value = Vec<Value>;
fn expecting(&self, formatter: &mut std::fmt::Formatter) -> std::fmt::Result {
write!(formatter, "a list")
}
fn visit_seq<A>(self, mut seq: A) -> Result<Self::Value, A::Error>
where
A: serde::de::SeqAccess<'de>,
{
let mut agg_map = serde_json::Map::new();
while let Some(item) = seq.next_element()? {
// If `item` isn't a JSON object, we'll skip it:
let Value::Object(map) = item else { continue };
// Get the first element, assuming we have some
let (k, v) = match map.into_iter().next() {
Some(kv) => kv,
None => continue,
};
// Ignore any null values; aggregate everything into a single map
if v == Value::Null {
self.0.records_skipped += 1;
continue;
} else {
println!("Keeping {k}={v}");
agg_map.insert(k, v);
}
}
let values = Value::Object(agg_map);
println!("Final value is {values}");
Ok(vec![values])
}
}
fn main() {
let fh = std::fs::File::open("input.json").unwrap();
let buf = std::io::BufReader::new(fh);
let read = serde_json::de::IoRead::new(buf);
let mut state = Stats::default();
let mut deserializer = serde_json::Deserializer::new(read);
let mydata = MyDataDeserializer { state: &mut state }
.deserialize(&mut deserializer)
.unwrap();
println!("{mydata:?}");
}
这段代码成功运行并正确地反序列化了我的输入数据.问题是,我想不出如何一次一个元素地传输‘Documents’array.我不知道如何把documents = map.next_value()?;
分变成一个能让该州降到DocumentDeserializer
分的分数.它应该使用类似于maybe的内容:
let d = DocumentDeserializer { state: self.0 }
.deserialize(&mut map)
.unwrap();
但.deserialize
分预计是serde::Deserializer<'de>
分,而map
分是serde::de::MapAccess<'de>
分.
不管怎样,这整件事似乎过于冗长,所以如果这不是普遍接受的或惯用的方法,我愿意接受另一种方法.正如链接问题中的OP所指出的,所有这些样板都令人反感.