我的收藏中有以下文件 . 每个文档都包含推文的文本和从推文中挑选出来的实体数组(使用AWS Comprehend):
{
"text" : "some tweet by John Smith in New York about Stack Overflow",
"entities" : [
{
"Type" : "ORGANIZATION",
"Text" : "stack overflow"
},
{
"Type" : "LOCATION",
"Text" : "new york"
},
{
"Type" : "PERSON",
"Text" : "john smith"
}
]
},
{
"text" : "another tweet by John Smith but this one from California and about Google",
"entities" : [
{
"Type" : "ORGANIZATION",
"Text" : "google"
},
{
"Type" : "LOCATION",
"Text" : "california"
},
{
"Type" : "PERSON",
"Text" : "john smith"
}
]
}
我想获得一个不同的 entities.Text
列表,按 entities.Type
分组,每个 entities.Text
的出现次数如下:
{ "_id" : "ORGANIZATION", "values" : [ {text:"stack overflow",count:1},{text:"google",count:1} ] }
{ "_id" : "LOCATION", "values" : [ {text:"new york",count:1},{text:"california",count:1} ] }
{ "_id" : "PERSON", "values" : [ {text:"john smith",count:2} ] }
我可以按 entities.Type
进行分组,并将所有 entities.Text
放入一个包含此查询的数组中:
db.collection.aggregate([
{
$unwind: '$entities'
},
{
$group: {
_id: '$entities.Type',
values: {
$push: '$entities.Text'
}
}
}])
这导致此输出包含重复值而不计数 .
{ "_id" : "ORGANIZATION", "values" : [ "stack overflow", "google" ] }
{ "_id" : "LOCATION", "values" : [ "new york", "california" ] }
{ "_id" : "PERSON", "values" : [ "john smith", "john smith" ] }
我开始沿着使用 $project
作为聚合的最后一步并添加带有javascript函数的计算字段 valuesMap
的路径 . 但后来我意识到你不能在聚合管道中编写javascript .
我的下一步将是使用普通的javascript处理mongoDB输出,但我希望(为了学习)使用mongoDB查询完成所有这些 .
谢谢!
2 回答
您可以尝试以下查询 . 你需要一个额外的
$group
来推送计数和文本 .