Week 5: Aggregation Framework

The aggregation pipeline is a framework for performing aggregation tasks, modeled on the concept of data processing pipelines. Using this framework, MongoDB passes the documents of a single collection through a pipeline

Let’s say there is a table

Name	Category	Manufacturer	Price
iPad	Tablet	Apple	499
S4	Cell Phone	Samsung	350

If I wanted to find out how many products from each manufacturer from each manufacturer, the way it is done in SQL is through a query :

Select manufacturer, count(*) from products group by manufacturer

We need to use Mongo aggregation framework to use similar to “group by“

use agg

db.products.aggregate([ {$group: { _id:”$manufacturer”,num_products:{$sum:1} }}])

Aggregation pipeline

Aggregation uses a pipeline in MongoDB. The concept of pipes is similar to unix. At the top is the collections. The documents are piped through the processing pipeline and they go through series of stages and will eventually get a result set. Each of the stage can happen multiple times.

Unwind denormalizes the data. For an array the command unwind will create a separate document for each key in the array with all other data being repeated in the document, thus creating redundant data.

In the above diagram

1:1 maps to same number of records

N:1 maps to only a subset of records returned

1:N represents a larger set of records returns due to unwind operation

Simple aggregation example expanded

If the above aggregation query, is run against a product collection, it

goes through each record looks for the manufacturer, if doesn’t exist, creates a record and adds the num_products value.

At the end of the iteration, a list of all the unique manufacturers and their respective number of products will be produced as a result set

Compound grouping

For compound grouping where traditionally we use queries such as

Select manufacturer, category, count(*) from products group by manufacturer, category

The below example groups by manufacturer and category

Using a document for _id

_id doesn’t always have to be a number or a string, the important aspect is that is has to be unique. It can also be a document.

Aggregate Expressions

The following are the different aggregation expressions

1. $sum – count and sum up the key

2. $avg - average

3. $min – minimum value of the key

4. $max – maximum value

5. $push – build arrays

6. $addToSet – add to set only adds uniquely

7. $first – after sorting the document produces the first document

8. $last – after sorting the document produces the last document

Using $sum

Using $avg

Using addToSet

Using $push

Difference between push and addToSet is that push doesn’t check for duplicates and it just adds the same. . addToSet adds by checking for duplicates

Using Max and min

Double Grouping

You can run more than one aggregation statement

Example:

Using $project

Project example

use agg

db.products.aggregate([

{$project:

{

_id:0,

'maker': {$toLower:"$manufacturer"},

'details': {'category': "$category",

'price' : {"$multiply":["$price",10]}

'item':'$name'

}

])

use agg

db.zips.aggregate([{$project:{_id:0, city:{$toLower:"$city"}, pop:1, state:1, zip:"$_id"}}])

Using $match

use agg

db.zips.aggregate([

    {$match:

         state:"NY"

},

    {$group:

         _id: "$city",

         population: {$sum:"$pop"},

         zip_codes: {$addToSet: "$_id"}

},

    {$project:

         _id: 0,

         city: "$_id",

         population: 1,

         zip_codes:1

])

use agg

db.zips.aggregate([

    {$match:

         state:"NY"

},

    {$group:

         _id: "$city",

         population: {$sum:"$pop"},

         zip_codes: {$addToSet: "$_id"}

])

Using $sort

Sort happens in memory and hence can hog memory

If the sort is before grouping and after match, it can use index

If the sort is after grouping it cannot use index

use agg

db.zips.aggregate([

    {$match:

         state:"NY"

},

    {$group:

         _id: "$city",

         population: {$sum:"$pop"},

},

    {$project:

         _id: 0,

         city: "$_id",

         population: 1,

},

    {$sort:

         population:-1

])

$limit and $skip

use agg

db.zips.aggregate([

    {$match:

         state:"NY"

},

    {$group:

         _id: "$city",

         population: {$sum:"$pop"},

},

    {$project:

         _id: 0,

         city: "$_id",

         population: 1,

},

    {$sort:

         population:-1

},

    {$skip: 10},

    {$limit: 5}

])

Using $unwind

db.posts.aggregate([

    /* unwind by tags */

    {"$unwind":"$tags"},

    /* now group by tags, counting each tag */

    {"$group":

     {"_id":"$tags",

      "count":{$sum:1}

},

    /* sort by popularity */

    {"$sort":{"count":-1}},

    /* show me the top 10 */

    {"$limit": 10},

    /* change the name of _id to be tag */

    {"$project":

     {_id:0,

      'tag':'$_id',

      'count' : 1

])

db.posts.aggregate([{"$unwind":"$comments"},{$group:{"_id":{"author":"$comments.author"},count:{"$sum":1}

}},

{$sort:

{

count:-1

}

{$limit: 1}

])

Some examples:

Avg score homework 5.3

db.grades.aggregate([ {$unwind:'$scores'},{$match:{'scores.type':{$in:['exam','homework']}}},{$group:{_id:

{"studentId":'$student_id',"classId":"$class_id"},Avgscore:{$avg:'$scores.score'}}},{$group:

{_id:"$_id.classId","Avgclassscore":{"$avg":"$Avgscore"}}},

{$sort:

{

Avgclassscore:-1

}

])

SQL to Aggregation Mapping

http://docs.mongodb.org/manual/reference/sql-aggregation-comparison/

Limitations to aggregation framework

1. The result set is limited to 16MB of memory

2. You cannot use more than 10% of memory on a machine

3. Sharding: Aggregation does work on a sharded environment, but after the first $group or $sort phase, the aggregation has to be brought back to the MongoS

Alternates of aggregation framework

1. MapReduce

2. Hadoop

Please Note : This is a series of 6

Reference: All the material credit goes to the course hosted by Mongo

Prashanth B Panduranga

Friday, May 23, 2014

Mongo Learning Series 5