首页 文章

MongoDB从数组中获取不同的元素WITH每个元素的出现次数

提问于
浏览
1

我的收藏中有以下文件 . 每个文档都包含推文的文本和从推文中挑选出来的实体数组(使用AWS Comprehend):

{
"text" : "some tweet by John Smith in New York about Stack Overflow",
"entities" : [
    {
        "Type" : "ORGANIZATION",
        "Text" : "stack overflow"
    },
    {
        "Type" : "LOCATION",
        "Text" : "new york"
    },
    {
        "Type" : "PERSON",
        "Text" : "john smith"
    }
  ]
},
{
    "text" : "another tweet by John Smith but this one from California and about Google",
    "entities" : [
    {
        "Type" : "ORGANIZATION",
        "Text" : "google"
    },
    {
        "Type" : "LOCATION",
        "Text" : "california"
    },
    {
        "Type" : "PERSON",
        "Text" : "john smith"
    }
  ]
}

我想获得一个不同的 entities.Text 列表,按 entities.Type 分组,每个 entities.Text 的出现次数如下:

{ "_id" : "ORGANIZATION", "values" : [ {text:"stack overflow",count:1},{text:"google",count:1} ] }
{ "_id" : "LOCATION", "values" : [ {text:"new york",count:1},{text:"california",count:1} ] }
{ "_id" : "PERSON", "values" : [ {text:"john smith",count:2} ] }

我可以按 entities.Type 进行分组,并将所有 entities.Text 放入一个包含此查询的数组中:

db.collection.aggregate([
{
    $unwind: '$entities'
}, 
{
    $group: {
        _id: '$entities.Type',
        values: {
            $push: '$entities.Text'
    }
}
}])

这导致此输出包含重复值而不计数 .

{ "_id" : "ORGANIZATION", "values" : [ "stack overflow", "google" ] }
{ "_id" : "LOCATION", "values" : [ "new york", "california" ] }
{ "_id" : "PERSON", "values" : [ "john smith", "john smith" ] }

我开始沿着使用 $project 作为聚合的最后一步并添加带有javascript函数的计算字段 valuesMap 的路径 . 但后来我意识到你不能在聚合管道中编写javascript .

我的下一步将是使用普通的javascript处理mongoDB输出,但我希望(为了学习)使用mongoDB查询完成所有这些 .

谢谢!

2 回答

  • 4

    您可以尝试以下查询 . 你需要一个额外的 $group 来推送计数和文本 .

    db.collection.aggregate(
    [
      {"$unwind":"$entities"},
      {"$group":{
        "_id":{"type":"$entities.Type","text":"$entities.Text"},
        "count":{"$sum":1}
      }},
      {"$group":{
        "_id":"$_id.type",
        "values":{"$push":{"text":"$_id.text","count":"$count"}}
      }}
    ])
    
  • 0
    db.collection.aggregate(
    
        // Pipeline
        [
            // Stage 1
            {
                $unwind: {
                    path: '$entities'
                }
            },
    
            // Stage 2
            {
                $group: {
                    _id: {
                        Text: '$entities.Text'
                    },
                    count: {
                        $sum: 1
                    },
                    Type: {
                        $addToSet: '$entities.Type'
                    }
                }
            },
    
            // Stage 3
            {
                $group: {
                    _id: {
                        Type: '$Type'
                    },
                    values: {
                        $addToSet: {
                            text: '$_id.Text',
                            count: '$count'
                        }
                    }
                }
            },
    
            // Stage 4
            {
                $project: {
                    values: 1,
                    _id: {
                        $arrayElemAt: ['$_id.Type', 0]
                    }
                }
            }
    
        ]
    
    
    );
    

相关问题