Mongodb 聚合 $lookup 匹配管道中文档的总大小超过最大文档大小

发布于08月17日

我有一个非常简单的$lookup聚合查询，如下所示:

{'$lookup':
 {'from': 'edge',
  'localField': 'gid',
  'foreignField': 'to',
  'as': 'from'}}

当我在有足够多文档的匹配上运行此操作时，会出现以下错误:

Command failed with error 4568: 'Total size of documents in edge
matching { $match: { $and: [ { from: { $eq: "geneDatabase:hugo" }
}, {} ] } } exceeds maximum document size' on server

所有限制文档数量的try 都失败了.allowDiskUse: true不起作用.发送cursor信号无效.向聚合中添加$limit也会失败.

这怎么可能？

然后我又看到了错误.$match、$and和$eq是从哪里来的？幕后的聚合管道是否将$lookup调用转移到另一个聚合，一个它自己运行的聚合，我无法提供限制或使用游标？？

这是怎么回事？

推荐答案

如前所述，出现此错误的原因是，在执行$lookup时(默认情况下，$lookup会根据外部集合的结果在父文档中生成目标"数组")，为该数组 Select 的文档的总大小会导致父文档超过16MB BSON Limit.

这个计数器将使用紧跟$lookup管道阶段的$unwind进行处理.这实际上改变了$lookup的行为，结果不是在父对象中生成数组，而是每个匹配文档的每个父对象的"副本".

与$unwind的常规用法非常相似，只是unwinding操作实际上被添加到$lookup管道操作本身，而不是作为"单独的"管道阶段进行处理.理想情况下，在$unwind后面加上$match条件，这也会创建一个matching参数，并将其添加到$lookup.您可以在管道的explain输出中看到这一点.

核心文档中的Aggregation Pipeline Optimization节实际上(简要)介绍了该主题:

$lookup + $unwind Coalescence

New in version 3.2.

当一个$unwind紧接着另一个$lookup，并且$unwind在$lookup的as字段上运行时，优化器可以将$unwind合并到$lookup阶段.这样可以避免创建大型中间文档.

最好的例子是，通过创建超过16MB BSON限制的"相关"文档，让服务器承受压力.尽可能简短地打破并绕过BSON限制:

const MongoClient = require('mongodb').MongoClient;

const uri = 'mongodb://localhost/test';

function data(data) {
  console.log(JSON.stringify(data, undefined, 2))
}

(async function() {

  let db;

  try {
    db = await MongoClient.connect(uri);

    console.log('Cleaning....');
    // Clean data
    await Promise.all(
      ["source","edge"].map(c => db.collection(c).remove() )
    );

    console.log('Inserting...')

    await db.collection('edge').insertMany(
      Array(1000).fill(1).map((e,i) => ({ _id: i+1, gid: 1 }))
    );
    await db.collection('source').insert({ _id: 1 })

    console.log('Fattening up....');
    await db.collection('edge').updateMany(
      {},
      { $set: { data: "x".repeat(100000) } }
    );

    // The full pipeline. Failing test uses only the $lookup stage
    let pipeline = [
      { $lookup: {
        from: 'edge',
        localField: '_id',
        foreignField: 'gid',
        as: 'results'
      }},
      { $unwind: '$results' },
      { $match: { 'results._id': { $gte: 1, $lte: 5 } } },
      { $project: { 'results.data': 0 } },
      { $group: { _id: '$_id', results: { $push: '$results' } } }
    ];

    // List and iterate each test case
    let tests = [
      'Failing.. Size exceeded...',
      'Working.. Applied $unwind...',
      'Explain output...'
    ];

    for (let [idx, test] of Object.entries(tests)) {
      console.log(test);

      try {
        let currpipe = (( +idx === 0 ) ? pipeline.slice(0,1) : pipeline),
            options = (( +idx === tests.length-1 ) ? { explain: true } : {});

        await new Promise((end,error) => {
          let cursor = db.collection('source').aggregate(currpipe,options);
          for ( let [key, value] of Object.entries({ error, end, data }) )
            cursor.on(key,value);
        });
      } catch(e) {
        console.error(e);
      }

    }

  } catch(e) {
    console.error(e);
  } finally {
    db.close();
  }

})();

插入一些初始数据后，列表将try 运行仅由$lookup个数据组成的聚合，该聚合将失败，并出现以下错误:

{MongoError:边缘匹配管道中的文档的总大小{$match:{$and:[{gid:{$eq:1}，{}]}超过了最大文档大小

这基本上是告诉你在检索时超过了BSON限制.

相比之下，下一次try 添加了$unwind和$match管道阶段

The Explain output:

  {
    "$lookup": {
      "from": "edge",
      "as": "results",
      "localField": "_id",
      "foreignField": "gid",
      "unwinding": {                        // $unwind now is unwinding
        "preserveNullAndEmptyArrays": false
      },
      "matching": {                         // $match now is matching
        "$and": [                           // and actually executed against 
          {                                 // the foreign collection
            "_id": {
              "$gte": 1
            }
          },
          {
            "_id": {
              "$lte": 5
            }
          }
        ]
      }
    }
  },
  // $unwind and $match stages removed
  {
    "$project": {
      "results": {
        "data": false
      }
    }
  },
  {
    "$group": {
      "_id": "$_id",
      "results": {
        "$push": "$results"
      }
    }
  }

结果当然是成功的，因为当结果不再被放入父文档中时，就不能超过BSON限制.

这实际上只是因为只添加了$unwind，但添加了$match，例如，表明这是also添加到$lookup阶段，总体效果是以有效的方式"限制"返回的结果，因为这都是在$lookup操作中完成的，除了那些匹配的结果之外，没有其他结果被实际返回.

通过这种方式构造，您可以查询将超过BSON限制的"引用数据"，然后如果您想要$group，则在$lookup实际执行的"隐藏查询"对结果进行有效过滤后，将结果返回到数组格式.

MongoDB 3.6 and Above - Additional for "LEFT JOIN"

正如上面所有内容所指出的，BSON限制是一个"hard"限制，你不能违反，这就是为什么作为一个过渡步骤，$unwind是必要的.然而，有一个限制，即"左连接"由于$unwind而成为"内部连接"，它不能保留内容.此外，即使是preserveNulAndEmptyArrays也会否定"合并"，仍然保留完整的数组，从而导致相同的BSON限制问题.

MongoDB 3.6在$lookup中添加了新语法，允许使用"子管道"表达式代替"本地"和"外键".因此，不必像演示的那样使用"合并"选项，只要生成的数组也不违反限制，就可以在返回数组"完整"的管道中设置条件，并且可能没有匹配项，这表明存在"左连接".

新的表述是:

{ "$lookup": {
  "from": "edge",
  "let": { "gid": "$gid" },
  "pipeline": [
    { "$match": {
      "_id": { "$gte": 1, "$lte": 5 },
      "$expr": { "$eq": [ "$$gid", "$to" ] }
    }}          
  ],
  "as": "from"
}}

事实上，这基本上就是MongoDB对前面的语法所做的，因为3.6使用$expr"内部"来构造语句.当然，区别在于$lookup的实际执行方式没有"unwinding"选项.

如果"pipeline"表达式实际上没有生成任何文档，那么主文档中的目标数组实际上将是空的，就像"左连接"实际上是空的一样，这是$lookup的正常行为，没有任何其他选项.

然而，输出数组的大小为MUST NOT cause the document where it is being created to exceed the BSON Limit.因此，您需要确保条件下的任何"匹配"内容都保持在该限制下，否则相同的错误将持续存在，当然，除非您实际使用$unwind来实现"内部连接".

Mongodb 聚合 $lookup 匹配管道中文档的总大小超过最大文档大小

推荐答案

MongoDB 3.6 and Above - Additional for "LEFT JOIN"

Mongodb相关问答推荐

用其他集合中的文档替换嵌套文档数组中的值

在mongo聚合管道的组阶段排除字段，但在最后将其包含在内

如何从集合中移除所有匹配的数组项？

从具有多个数组匹配 MongoDB 的两个集合中采样数据

我可以在 MongoDB 中将字段值设置为对象键吗？

mongoDB文档数组字段中的唯一项如何

MongoDB 聚合使用 $match 和 $expr 和数组

MongoDB：使用数组过滤器进行更新插入

解析命令行时出错：unrecognized option --rest

MongoDB 的 BinData(0, "e8MEnzZoFyMmD7WSHdNrFJyEk8M=") 中的0是什么意思？

从 MongoDB find() 结果集中识别最后一个文档

MongoDB 聚合 $divide 计算字段

在 Nodejs 中配置最大旧空间大小

如何在我的Meteor 应用程序数据库中使用 mongoimport？

适用于 Windows 10 64 位的 MongoDB 下载

具有简单密码认证的 MongoDB 副本集

REACT 获取发布请求

TypeError： object of type 'Cursor' has no len()

mongoosefind()不返回结果

MongoDb - 利用多 CPU 服务器进行写入繁重的应用程序