聚合示例

发布于 2015-09-14 14:43:25 | 232 次阅读 | 评论: 0 | 来源: 网络整理

MongoDB provides flexible data aggregation functionality with the aggregate command. For additional information about aggregation consider the following resources:

This document provides a number of practical examples that display the capabilities of the aggregation framework. All examples use a publicly available data set of all zipcodes and populations in the United States.

需求¶

mongod 和 mongo, 版本 2.2 或更高.

使用邮编数据集聚合¶

To run you will need the zipcode data set. These data are available at: media.mongodb.org/zips.json. Use mongoimport to load this data set into your mongod instance.

数据模型¶

集合里的文档样式：

{
  "_id": "10280",
  "city": "NEW YORK",
  "state": "NY",
  "pop": 5574,
  "loc": [
    -74.016323,
    40.710537
  ]
}

在这些文档里:

_id 字段存储字符型邮编。
city 字段存储城市.
state 字段存储两个字母的州名缩写.
pop 字段存储人口.
loc 字段存储经纬度.

以下所有实例在 mongo 壳里使用 aggregate() 助手。 aggregate() provides a wrapper around the aggregate database command. See the documentation for your driver for a more idiomatic interface for data aggregation operations.

人口超过1000万的州¶

要返回所有人口超过1000万的州，使用下列聚合操作:

db.zipcodes.aggregate( { $group :
                         { _id : "$state",
                           totalPop : { $sum : "$pop" } } },
                       { $match : {totalPop : { $gte : 10*1000*1000 } } } )

聚合操作使用 aggregate() 助手, 在 zipcodes 集合里处理所有文档。 aggregate() 一些管道算子定义聚合过程.

在以上实例中, 管道通过以下步骤运送所有 zipcodes 集合里的文档:

$group 运算收集所有文档并为每个州创建文档。

These new per-state documents have one field in addition the _id field: totalPop which is a generated field using the $sum operation to calculate the total value of all pop fields in the source documents.

$group 在管道中操作文档之后返回如下内容:
```
{
  "_id" : "AK",
  "totalPop" : 550043
}
```
$match 运算过滤这些文档，只剩下那些 totalPop 大于一千万的。

$match 运算不修改文档,只是有通过 $group 管道形成的相同格式的文档输出。

这个运算等效于 SQL :

SELECT state, SUM(pop) AS pop
       FROM zips
       GROUP BY state
       HAVING pop > (10*1000*1000)

州的城市平均人口¶

要返回每个州的城市平均人口，请使用以下聚合操作:

db.zipcodes.aggregate( { $group :
                         { _id : { state : "$state", city : "$city" },
                           pop : { $sum : "$pop" } } },
                       { $group :
                       { _id : "$_id.state",
                         avgCityPop : { $avg : "$pop" } } } )

Aggregations operations using the aggregate() helper, process all documents on the zipcodes collection. aggregate() a number of pipeline operators that define the aggregation process.

In the above example, the pipeline passes all documents in the zipcodes collection through the following steps:

the $group operator collects all documents and creates new documents for every combination of the city and state fields in the source document.

After this stage in the pipeline, the documents resemble the following:
```
{
  "_id" : {
    "state" : "CO",
    "city" : "EDGEWATER"
  },
  "pop" : 13154
}
```
the second $group operator collects documents by the state field and use the $avg expression to compute a value for the avgCityPop field.

The final output of this aggregation operation is:

{
  "_id" : "MN",
  "avgCityPop" : 5335
},

州的人口最大和最小城市¶

要返回每个州人口最大最小城市，使用下列聚合操作:

db.zipcodes.aggregate( { $group:
                         { _id: { state: "$state", city: "$city" },
                           pop: { $sum: "$pop" } } },
                       { $sort: { pop: 1 } },
                       { $group:
                         { _id : "$_id.state",
                           biggestCity:  { $last: "$_id.city" },
                           biggestPop:   { $last: "$pop" },
                           smallestCity: { $first: "$_id.city" },
                           smallestPop:  { $first: "$pop" } } },

                       // the following $project is optional, and
                       // modifies the output format.

                       { $project:
                         { _id: 0,
                           state: "$_id",
                           biggestCity:  { name: "$biggestCity",  pop: "$biggestPop" },
                           smallestCity: { name: "$smallestCity", pop: "$smallestPop" } } } )

Aggregations operations using the aggregate() helper, process all documents on the zipcodes collection. aggregate() a number of pipeline operators that define the aggregation process.

All documents from the zipcodes collection pass into the pipeline, which consists of the following steps:

the $group operator collects all documents and creates new documents for every combination of the city and state fields in the source documents.

By specifying the value of _id as a sub-document that contains both fields, the operation preserves the state field for use later in the pipeline. The documents produced by this stage of the pipeline have a second field, pop, which uses the $sum operator to provide the total of the pop fields in the source document.

At this stage in the pipeline, the documents resemble the following:
```
{
  "_id" : {
    "state" : "CO",
    "city" : "EDGEWATER"
  },
  "pop" : 13154
}
```
$sort operator orders the documents in the pipeline based on the vale of the pop field from largest to smallest. This operation does not alter the documents.
the second $group operator collects the documents in the pipeline by the state field, which is a field inside the nested _id document.

Within each per-state document this $group operator specifies four fields: Using the $last expression, the $group operator creates the biggestcity and biggestpop fields that store the city with the largest population and that population. Using the $first expression, the $group operator creates the smallestcity and smallestpop fields that store the city with the smallest population and that population.

The documents, at this stage in the pipeline resemble the following:
```
{
  "_id" : "WA",
  "biggestCity" : "SEATTLE",
  "biggestPop" : 520096,
  "smallestCity" : "BENGE",
  "smallestPop" : 2
}
```
The final operation is $project, which renames the _id field to state and moves the biggestCity, biggestPop, smallestCity, and smallestPop into biggestCity and smallestCity sub-documents.

这个聚合操作最终输出:

{
  "state" : "RI",
  "biggestCity" : {
    "name" : "CRANSTON",
    "pop" : 176404
  },
  "smallestCity" : {
    "name" : "CLAYVILLE",
    "pop" : 45
  }
}

用户偏好数据的聚合¶

数据模型¶

假设一个体育俱乐部数据库，包含跟踪用户的加入时间，运动喜好 user 集合，存储这些数据到文档，类似下面这样的：

{
  _id : "jane",
  joined : ISODate("2011-03-02"),
  likes : ["golf", "racquetball"]
}
{
  _id : "joe",
  joined : ISODate("2012-07-02"),
  likes : ["tennis", "golf", "swimming"]
}

规范和排序文档¶

下面的操作将返回按字母顺序排列大写的用户名。聚合包含 users 集合所有文档的用户名。按名排序.

db.users.aggregate(
  [
    { $project : { name:{$toUpper:"$_id"} , _id:0 } },
    { $sort : { name : 1 } }
  ]
)

来自 users 集合全部文档穿过管道, 包含一下操作:

$project 算子:
- 创建名为 name 的字段.
- 使用 $toUpper 算子把 _id 值转为大写. 然后 $project 创建名为 name 新字段存储这个值。
- 抑制 id 字段. $project 默认传输 _id 字段, 除非明确地抑制.
$sort 算子通过 name 字段排序结果.

聚合的结果将类似于以下:

{
  "name" : "JANE"
},
{
  "name" : "JILL"
},
{
  "name" : "JOE"
}

通过注册月返回有序的用户名¶

以下聚合操作返回按他们加入的月份排序的用户名。这个类型的聚合能帮助生成会员更新通知。

db.users.aggregate(
  [
    { $project : { month_joined : {
                                    $month : "$joined"
                                  },
                   name : "$_id",
                   _id : 0
                 },
    { $sort : { month_joined : 1 } }
  ]
)

管道通过以下操作传输所有 users 集合里的文档:

$project 算子:
- 创建两个新字段: month_joined 和 name.
- 从结果中抑制 id. aggregate() 方法包含 _id, 除非明确抑制。
$month 算子转化 joined 字段的值为一个月的整数表示.然后 $project 算子分配这些值到 month_joined 字段。
$sort 算子通过 month_joined 字段排序结果。

该操作返回的结果类似于下面这样的:

{
  "month_joined" : 1,
  "name" : "ruth"
},
{
  "month_joined" : 1,
  "name" : "harold"
},
{
  "month_joined" : 1,
  "name" : "kate"
}
{
  "month_joined" : 2,
  "name" : "jill"
}

返回每月加入总数¶

以下操作展示一年中每月多少人加入。您可以为招聘和营销策略等信息使用这些汇总数据。

db.users.aggregate(
  [
    { $project : { month_joined : { $month : "$joined" } } } ,
    { $group : { _id : {month_joined:"$month_joined"} , number : { $sum : 1 } } },
    { $sort : { "_id.month_joined" : 1 } }
  ]
)

管道通过以下操作传输所有 users 集合里的文档:

$project 算子创建新的命名为 month_joined 的字段。
$month 算子转换 joined 字段值为整数表示的月份。然后 $project 算子分配值到 month_joined 字段。
$group 算子 collects all documents with a given month_joined value and counts how many documents there are for that value. Specifically, for each unique value, $group creates a new “per-month” document with two fields:
- _id, which contains a nested document with the month_joined field and its value.
- number, which is a generated field. The $sum operator increments this field by 1 for every document containing the given month_joined value.
The $sort operator sorts the documents created by $group according to the contents of the month_joined field.

该操作返回的结果类似于下面这样的:

{
  "_id" : {
    "month_joined" : 1
  },
  "number" : 3
},
{
  "_id" : {
    "month_joined" : 2
  },
  "number" : 9
},
{
  "_id" : {
    "month_joined" : 3
  },
  "number" : 5
}

返回五个最常见的 “Likes”¶

The following aggregation collects top five most “liked” activities in the data set. In this data set, you might use an analysis of this to help inform planning and future development.

db.users.aggregate(
  [
    { $unwind : "$likes" },
    { $group : { _id : "$likes" , number : { $sum : 1 } } },
    { $sort : { number : -1 } },
    { $limit : 5 }
  ]
)

管道通过以下操作传输所有 users 集合里的文档:

$unwind 算子分开 each value in the likes array, and creates a new version of the source document for every element in the array.

Example

Given the following document from the users collection:

{
  _id : "jane",
  joined : ISODate("2011-03-02"),
  likes : ["golf", "racquetball"]
}

$unwind 算子 would create the following documents:

{
  _id : "jane",
  joined : ISODate("2011-03-02"),
  likes : "golf"
}
{
  _id : "jane",
  joined : ISODate("2011-03-02"),
  likes : "racquetball"
}

$group 算子 collects all documents the same value for the likes field and counts each grouping. With this information, $group creates a new document with two fields:
- _id, which contains the likes value.
- number, which is a generated field. The $sum operator increments this field by 1 for every document containing the given likes value.
$sort 算子 sorts these documents by the number field in reverse order.
$limit 算子 only includes the first 5 result documents.

该操作返回的结果类似于下面这样的:

{
  "_id" : "golf",
  "number" : 33
},
{
  "_id" : "racquetball",
  "number" : 31
},
{
  "_id" : "swimming",
  "number" : 24
},
{
  "_id" : "handball",
  "number" : 19
},
{
  "_id" : "tennis",
  "number" : 18
}

需求¶

使用邮编数据集聚合¶

数据模型¶

人口超过1000万的州¶

州的城市平均人口¶

州的人口最大和最小城市¶

用户偏好数据的聚合¶

数据模型¶

规范和排序文档¶

通过注册月返回有序的用户名¶

返回每月加入总数¶

返回五个最常见的 “Likes”¶

后端技术

前端技术

数据库

热门框架

常用IDE

其他