Skip to main content

Mongo Learning Series 5

Week 5: Aggregation Framework


The aggregation pipeline is a framework for performing aggregation tasks, modeled on the concept of data processing pipelines. Using this framework, MongoDB passes the documents of a single collection through a pipeline
Let’s say there is a table
Name
Category
Manufacturer
Price
iPad
Tablet
Apple
499
S4
Cell Phone
Samsung
350

If I wanted to find out how many products from each manufacturer from each manufacturer, the way it is done in SQL is through a query :
Select manufacturer, count(*) from products group by manufacturer
We need to use Mongo aggregation framework to use similar to “group by“
use agg
db.products.aggregate([ {$group: { _id:”$manufacturer”,num_products:{$sum:1} }}])

Aggregation pipeline

 Aggregation uses a pipeline in MongoDB.  The concept of pipes is similar to unix. At the top is the collections. The documents are piped through the processing pipeline and they go through series of stages and will eventually get a result set.  Each of the stage can happen multiple times.




Unwind denormalizes the data. For an array the command unwind will create a separate document for each key in the array with all other data being repeated in the document, thus creating redundant data.
In the above diagram
1:1 maps to same number of records
N:1 maps to only a subset of records returned
1:N represents a larger set of records returns due to unwind operation

Simple aggregation example expanded


If the above aggregation query, is run against a product collection, it 
goes through each record looks for the manufacturer, if doesn’t exist, creates a record and adds the num_products value.


At the end of the iteration, a list of all the unique manufacturers and their respective number of products will be produced as a result set

Compound grouping
For compound grouping where traditionally we use queries such as
Select manufacturer, category, count(*) from products group by manufacturer, category
The below example groups by manufacturer and category


 
Using a document for _id
_id doesn’t always have to be a number or a string, the important aspect is that is has to be unique. It can also be a document.


Aggregate Expressions
The following are the different aggregation expressions
1.       $sum – count and sum up the key
2.       $avg - average
3.       $min – minimum value of the key
4.       $max – maximum value
5.       $push – build arrays
6.       $addToSet – add to set only adds uniquely
7.       $first – after sorting the document produces the first document
8.       $last – after sorting the document produces the last document

Using $sum




Using  $avg


Using addToSet

Using $push
Difference between push and addToSet is that push doesn’t check for duplicates and it just adds the same. . addToSet adds by checking for duplicates


Using Max and min

Double Grouping
You can run more than one aggregation statement

Example:


Using $project


Project example
use agg
db.products.aggregate([
    {$project:
     {
         _id:0,
         'maker': {$toLower:"$manufacturer"},
         'details': {'category': "$category",
                    'price' : {"$multiply":["$price",10]}
                   },
         'item':'$name'
     }
    }
])

use agg
db.zips.aggregate([{$project:{_id:0, city:{$toLower:"$city"}, pop:1, state:1, zip:"$_id"}}])

Using $match


use agg
db.zips.aggregate([
    {$match:
     {
         state:"NY"
     }
    },
    {$group:
     {
         _id: "$city",
         population: {$sum:"$pop"},
         zip_codes: {$addToSet: "$_id"}
     }
    },
    {$project:
     {
         _id: 0,
         city: "$_id",
         population: 1,
         zip_codes:1
     }
    }
     
])

use agg
db.zips.aggregate([
    {$match:
     {
         state:"NY"
     }
    },
    {$group:
     {
         _id: "$city",
         population: {$sum:"$pop"},
         zip_codes: {$addToSet: "$_id"}
     }
    }
])


Using $sort

Sort happens in memory and hence can hog memory
If the sort is before grouping and after match, it can use index
If the sort is after grouping it cannot use index

use agg
db.zips.aggregate([
    {$match:
     {
         state:"NY"
     }
    },
    {$group:
     {
         _id: "$city",
         population: {$sum:"$pop"},
     }
    },
    {$project:
     {
         _id: 0,
         city: "$_id",
         population: 1,
     }
    },
    {$sort:
     {
         population:-1
     }
    }
      
    
     
])
 

$limit and $skip


use agg
db.zips.aggregate([
    {$match:
     {
         state:"NY"
     }
    },
    {$group:
     {
         _id: "$city",
         population: {$sum:"$pop"},
     }
    },
    {$project:
     {
         _id: 0,
         city: "$_id",
         population: 1,
     }
    },
    {$sort:
     {
         population:-1
     }
    },
    {$skip: 10},
    {$limit: 5}
])



Using $unwind


db.posts.aggregate([
    /* unwind by tags */
    {"$unwind":"$tags"},
    /* now group by tags, counting each tag */
    {"$group": 
     {"_id":"$tags",
      "count":{$sum:1}
     }
    },
    /* sort by popularity */
    {"$sort":{"count":-1}},
    /* show me the top 10 */
    {"$limit": 10},
    /* change the name of _id to be tag */
    {"$project":
     {_id:0,
      'tag':'$_id',
      'count' : 1
     }
    }
    ])



db.posts.aggregate([{"$unwind":"$comments"},{$group:{"_id":{"author":"$comments.author"},count:{"$sum":1}
}},
{$sort:
     {
                 count:-1
     }
    }
{$limit: 1}
])
Some examples:

Avg score homework 5.3
db.grades.aggregate([ {$unwind:'$scores'},{$match:{'scores.type':{$in:['exam','homework']}}},{$group:{_id:
{"studentId":'$student_id',"classId":"$class_id"},Avgscore:{$avg:'$scores.score'}}},{$group:
{_id:"$_id.classId","Avgclassscore":{"$avg":"$Avgscore"}}},
{$sort:
     {
                 Avgclassscore:-1
     }
    }

])

SQL to Aggregation Mapping



Limitations to aggregation framework

1.       The result set is limited to 16MB of memory
2.       You cannot use more than 10% of memory on a machine
3.       Sharding: Aggregation does work on a sharded environment, but after the first $group or $sort phase, the aggregation has to be brought back to the MongoS



Alternates of aggregation framework
1.       MapReduce
2.       Hadoop



Please Note : This is a series of 6 
Reference: All the material credit goes to the course hosted by Mongo

Comments

Popular posts from this blog

Mongo Learning Series 1

Mongo Learning First of all, I want to thank and congratulate the MongoDB team for hosting such a wonderful introductory interactive course.  Good job guys. For those interested here is the url https://education.mongodb.com/ It is a 7 week course. The syllabus follows: Week 1: Introduction Introduction & Overview - Overview, Design Goals, the Mongo Shell, JSON Intro, installing tools, overview of blog project. Bottle, Pymongo Week 2: CRUD CRUD (Creating, Reading and Updating Data) - Mongo shell, query operators, update operators and a few commands Week 3: Schema Design Schema Design - Patterns, case studies and tradeoffs Week 4: Performance Using indexes, monitoring and understanding performance. Performance in sharded environments. Week 5: Aggregation Framework Goals, the use of the pipeline...

TechSharp [T#] Going beyond Architecture Center of Excellence

Consulting Services companies goes through multitude of challenges in its Sales cycle, Delivery Cycle and over all Competency building and maintaining cycle. In this 2 part blog, I write about the various issues, Well whats the point in discussing problems with out a solution, Worry Not, The blog culminates with a tried and tested solution. Tried Architecture as Shared Services? Felt like Abstracting the best of the resources, while encapsulating them well within at the same time? Tried creating COE’s?  Have the management shot back stating it is overused/abused concept, tried and failed? Yes there are lot of reasons to fail when NOT done right. This blog entry documents the RIGHT way, tried and tested Recursively. Introduction What is TechSharp? Why do we need it? Is it an Architecture center of excellence? What is its significance? What issues does it resolve or even prevent from happening in the first place? Who benefits from it? What are the levels of Archi...
Continued . . Part 1 How it all fits together [Solution] Technology positioning system’s [TPS] output forms the key to the organizations focus, TechSharp will work on the trending Languages, Frameworks, Techniques, Tools, and Platforms across domains in the IT industry, and identify the right ones for the organization to adopt. This data will be used to propose and implement solution for new projects by the delivery teams. Sales force will use this data to sell the organizations capabilities. TPS will also align with the sales pipeline outliers, and form a combined Technology positioning for the organization. Imagine the possibilities of leveraging this data.. The Recruitment team can use this data to plan their hiring. They leverage TechSharp to put together the JD’s for the technologies. The LEAD team has a clear technology training focus. Delivery team can plan bench or recommend up skilling of existing str...