Friday, May 23, 2014

Mongo Learning Series 5

Week 5: Aggregation Framework


The aggregation pipeline is a framework for performing aggregation tasks, modeled on the concept of data processing pipelines. Using this framework, MongoDB passes the documents of a single collection through a pipeline
Let’s say there is a table
Name
Category
Manufacturer
Price
iPad
Tablet
Apple
499
S4
Cell Phone
Samsung
350

If I wanted to find out how many products from each manufacturer from each manufacturer, the way it is done in SQL is through a query :
Select manufacturer, count(*) from products group by manufacturer
We need to use Mongo aggregation framework to use similar to “group by“
use agg
db.products.aggregate([ {$group: { _id:”$manufacturer”,num_products:{$sum:1} }}])

Aggregation pipeline

 Aggregation uses a pipeline in MongoDB.  The concept of pipes is similar to unix. At the top is the collections. The documents are piped through the processing pipeline and they go through series of stages and will eventually get a result set.  Each of the stage can happen multiple times.




Unwind denormalizes the data. For an array the command unwind will create a separate document for each key in the array with all other data being repeated in the document, thus creating redundant data.
In the above diagram
1:1 maps to same number of records
N:1 maps to only a subset of records returned
1:N represents a larger set of records returns due to unwind operation

Simple aggregation example expanded


If the above aggregation query, is run against a product collection, it 
goes through each record looks for the manufacturer, if doesn’t exist, creates a record and adds the num_products value.


At the end of the iteration, a list of all the unique manufacturers and their respective number of products will be produced as a result set

Compound grouping
For compound grouping where traditionally we use queries such as
Select manufacturer, category, count(*) from products group by manufacturer, category
The below example groups by manufacturer and category


 
Using a document for _id
_id doesn’t always have to be a number or a string, the important aspect is that is has to be unique. It can also be a document.


Aggregate Expressions
The following are the different aggregation expressions
1.       $sum – count and sum up the key
2.       $avg - average
3.       $min – minimum value of the key
4.       $max – maximum value
5.       $push – build arrays
6.       $addToSet – add to set only adds uniquely
7.       $first – after sorting the document produces the first document
8.       $last – after sorting the document produces the last document

Using $sum




Using  $avg


Using addToSet

Using $push
Difference between push and addToSet is that push doesn’t check for duplicates and it just adds the same. . addToSet adds by checking for duplicates


Using Max and min

Double Grouping
You can run more than one aggregation statement

Example:


Using $project


Project example
use agg
db.products.aggregate([
    {$project:
     {
         _id:0,
         'maker': {$toLower:"$manufacturer"},
         'details': {'category': "$category",
                    'price' : {"$multiply":["$price",10]}
                   },
         'item':'$name'
     }
    }
])

use agg
db.zips.aggregate([{$project:{_id:0, city:{$toLower:"$city"}, pop:1, state:1, zip:"$_id"}}])

Using $match


use agg
db.zips.aggregate([
    {$match:
     {
         state:"NY"
     }
    },
    {$group:
     {
         _id: "$city",
         population: {$sum:"$pop"},
         zip_codes: {$addToSet: "$_id"}
     }
    },
    {$project:
     {
         _id: 0,
         city: "$_id",
         population: 1,
         zip_codes:1
     }
    }
     
])

use agg
db.zips.aggregate([
    {$match:
     {
         state:"NY"
     }
    },
    {$group:
     {
         _id: "$city",
         population: {$sum:"$pop"},
         zip_codes: {$addToSet: "$_id"}
     }
    }
])


Using $sort

Sort happens in memory and hence can hog memory
If the sort is before grouping and after match, it can use index
If the sort is after grouping it cannot use index

use agg
db.zips.aggregate([
    {$match:
     {
         state:"NY"
     }
    },
    {$group:
     {
         _id: "$city",
         population: {$sum:"$pop"},
     }
    },
    {$project:
     {
         _id: 0,
         city: "$_id",
         population: 1,
     }
    },
    {$sort:
     {
         population:-1
     }
    }
      
    
     
])
 

$limit and $skip


use agg
db.zips.aggregate([
    {$match:
     {
         state:"NY"
     }
    },
    {$group:
     {
         _id: "$city",
         population: {$sum:"$pop"},
     }
    },
    {$project:
     {
         _id: 0,
         city: "$_id",
         population: 1,
     }
    },
    {$sort:
     {
         population:-1
     }
    },
    {$skip: 10},
    {$limit: 5}
])



Using $unwind


db.posts.aggregate([
    /* unwind by tags */
    {"$unwind":"$tags"},
    /* now group by tags, counting each tag */
    {"$group": 
     {"_id":"$tags",
      "count":{$sum:1}
     }
    },
    /* sort by popularity */
    {"$sort":{"count":-1}},
    /* show me the top 10 */
    {"$limit": 10},
    /* change the name of _id to be tag */
    {"$project":
     {_id:0,
      'tag':'$_id',
      'count' : 1
     }
    }
    ])



db.posts.aggregate([{"$unwind":"$comments"},{$group:{"_id":{"author":"$comments.author"},count:{"$sum":1}
}},
{$sort:
     {
                 count:-1
     }
    }
{$limit: 1}
])
Some examples:

Avg score homework 5.3
db.grades.aggregate([ {$unwind:'$scores'},{$match:{'scores.type':{$in:['exam','homework']}}},{$group:{_id:
{"studentId":'$student_id',"classId":"$class_id"},Avgscore:{$avg:'$scores.score'}}},{$group:
{_id:"$_id.classId","Avgclassscore":{"$avg":"$Avgscore"}}},
{$sort:
     {
                 Avgclassscore:-1
     }
    }

])

SQL to Aggregation Mapping



Limitations to aggregation framework

1.       The result set is limited to 16MB of memory
2.       You cannot use more than 10% of memory on a machine
3.       Sharding: Aggregation does work on a sharded environment, but after the first $group or $sort phase, the aggregation has to be brought back to the MongoS



Alternates of aggregation framework
1.       MapReduce
2.       Hadoop



Please Note : This is a series of 6 
Reference: All the material credit goes to the course hosted by Mongo

No comments: