发布于 2015-09-14 14:49:38 | 335 次阅读 | 评论: 0 | 来源: 网络整理
2.1 新版功能.
聚合框架提供一种方法来计算汇总值,而无需使用 :term:`map-reduce`。虽然映射化简是强大的,但它往往比简单的汇总任务更困难,如字段值总和或平均值。
如果你熟悉 SQL`,聚合框架提供了类似的功能, ``GROUP BY` 和相关的SQL运算符以及简单形式的“self joins”。此外,聚合框架提供了投影功能,以重塑返回的数据。使用的聚合框架中的投影,你可以添加计算字段,创建新的虚拟子对象,并提取子对象到结果的顶级。
从概念上,集合的文档通过一个聚合管道,当他们通过的时候,转换这些对象。对于那些熟悉的类UNIX壳(如:bash),这概念同用于字符串过滤的管道(即 |)是类似的。
在壳环境里,管道重定向一个进程的输出到另一个进程的输入。集合管道从一个 管道运算符 流运文档到下一个的方式来处理文档。在管道中,管道运算符可以重复。
所有的管道操作符都是处理文档流的,如果操作符扫描 collection 管道将起作用,并且将所有匹配的文档插入管道的”顶部”,管道里操作符转换通过管道的每个文档。
注解
管道操作符不需要为每一个输入文档生成一个输出文档:操作符也可能会产生新的文档或过滤文档。
警告
管道不能运算以下类型的值: Binary, Symbol, MinKey, MaxKey, DBRef, Code, 以及 CodeWScope.
在 mongo 壳或者 数据库命令 aggregate 里使用 aggregate() 包装器调用 聚合`。通常在集合对象里调用 :method:`~db.collection.aggregate() 来决定集合 管道 输入文档。:method:~db.collection.aggregate() 方法的参数指定一系列 管道算子,其中每个操作符可以有许多操作数。
第一步, 假设一个名为 articles 文档 :term:`集合`,格式如下:
{
title : "this is my title" ,
author : "bob" ,
posted : new Date () ,
pageViews : 5 ,
tags : [ "fun" , "good" , "fun" ] ,
comments : [
{ author :"joe" , text : "this is cool" } ,
{ author :"sam" , text : "this is bad" }
],
other : { foo : 5 }
}
The following example aggregation operation pivots data to create a set of author names grouped by tags applied to an article. 通过发出以下命令唤出聚合框架:
db.articles.aggregate(
{ $project : {
author : 1,
tags : 1,
} },
{ $unwind : "$tags" },
{ $group : {
_id : { tags : "$tags" },
authors : { $addToSet : "$author" }
} }
);
The aggregation pipeline begins with the collection articles and selects the author and tags fields using the $project aggregation operator. The $unwind operator produces one output document per tag. Finally, the $group operator pivots these fields.
Because you will always call aggregate on a collection object, which logically inserts the entire collection into the aggregation pipeline, you may want to optimize the operation by avoiding scanning the entire collection whenever possible.
Depending on the order in which they appear in the pipeline, aggregation operators can take advantage of indexes.
The following pipeline operators take advantage of an index when they occur at the beginning of the pipeline:
The above operators can also use an index when placed before the following aggregation operators:
If your aggregation operation requires only a subset of the data in a collection, use the $match operator to restrict which items go in to the top of the pipeline, as in a query. When placed early in a pipeline, these $match operations use suitable indexes to scan only the matching documents in a collection.
Placing a $match pipeline stage followed by a $sort stage at the start of the pipeline is logically equivalent to a single query with a sort, and can use an index.
In future versions there may be an optimization phase in the pipeline that reorders the operations to increase performance without affecting the result. However, at this time place $match operators at the beginning of the pipeline when possible.
Certain pipeline operators require access to the entire input set before they can produce any output. For example, $sort must receive all of the input from the preceding pipeline operator before it can produce its first output document. The current implementation of $sort does not go to disk in these cases: in order to sort the contents of the pipeline, the entire input must fit in memory.
$group has similar characteristics: Before any $group passes its output along the pipeline, it must receive the entirety of its input. For the $group operator, this frequently does not require as much memory as $sort, because it only needs to retain one record for each unique key in the grouping specification.
The current implementation of the aggregation framework logs a warning if a cumulative operator consumes 5% or more of the physical memory on the host. Cumulative operators produce an error if they consume 10% or more of the physical memory on the host.
注解
在 2.1 版更改.
Some aggregation operations using aggregate will cause mongos instances to require more CPU resources than in previous versions. This modified performance profile may dictate alternate architectural decisions if you use the aggregation framework extensively in a sharded environment.
The aggregation framework is compatible with sharded collections.
When operating on a sharded collection, the aggregation pipeline is split into two parts. The aggregation framework pushes all of the operators up to the first $group or $sort operation to each shard. [1] Then, a second pipeline on the mongos runs. This pipeline consists of the first $group or $sort and any remaining pipeline operators, and runs on the results received from the shards.
The $group operator brings in any “sub-totals” from the shards and combines them: in some cases these may be structures. For example, the $avg expression maintains a total and count for each shard; mongos combines these values and then divides.
[1] | If an early $match can exclude shards through the use of the shard key in the predicate, then these operators are only pushed to the relevant shards. |