Skip to main content

Mongo Learning Series 5

Week 5: Aggregation Framework


The aggregation pipeline is a framework for performing aggregation tasks, modeled on the concept of data processing pipelines. Using this framework, MongoDB passes the documents of a single collection through a pipeline
Let’s say there is a table
Name
Category
Manufacturer
Price
iPad
Tablet
Apple
499
S4
Cell Phone
Samsung
350

If I wanted to find out how many products from each manufacturer from each manufacturer, the way it is done in SQL is through a query :
Select manufacturer, count(*) from products group by manufacturer
We need to use Mongo aggregation framework to use similar to “group by“
use agg
db.products.aggregate([ {$group: { _id:”$manufacturer”,num_products:{$sum:1} }}])

Aggregation pipeline

 Aggregation uses a pipeline in MongoDB.  The concept of pipes is similar to unix. At the top is the collections. The documents are piped through the processing pipeline and they go through series of stages and will eventually get a result set.  Each of the stage can happen multiple times.




Unwind denormalizes the data. For an array the command unwind will create a separate document for each key in the array with all other data being repeated in the document, thus creating redundant data.
In the above diagram
1:1 maps to same number of records
N:1 maps to only a subset of records returned
1:N represents a larger set of records returns due to unwind operation

Simple aggregation example expanded


If the above aggregation query, is run against a product collection, it 
goes through each record looks for the manufacturer, if doesn’t exist, creates a record and adds the num_products value.


At the end of the iteration, a list of all the unique manufacturers and their respective number of products will be produced as a result set

Compound grouping
For compound grouping where traditionally we use queries such as
Select manufacturer, category, count(*) from products group by manufacturer, category
The below example groups by manufacturer and category


 
Using a document for _id
_id doesn’t always have to be a number or a string, the important aspect is that is has to be unique. It can also be a document.


Aggregate Expressions
The following are the different aggregation expressions
1.       $sum – count and sum up the key
2.       $avg - average
3.       $min – minimum value of the key
4.       $max – maximum value
5.       $push – build arrays
6.       $addToSet – add to set only adds uniquely
7.       $first – after sorting the document produces the first document
8.       $last – after sorting the document produces the last document

Using $sum




Using  $avg


Using addToSet

Using $push
Difference between push and addToSet is that push doesn’t check for duplicates and it just adds the same. . addToSet adds by checking for duplicates


Using Max and min

Double Grouping
You can run more than one aggregation statement

Example:


Using $project


Project example
use agg
db.products.aggregate([
    {$project:
     {
         _id:0,
         'maker': {$toLower:"$manufacturer"},
         'details': {'category': "$category",
                    'price' : {"$multiply":["$price",10]}
                   },
         'item':'$name'
     }
    }
])

use agg
db.zips.aggregate([{$project:{_id:0, city:{$toLower:"$city"}, pop:1, state:1, zip:"$_id"}}])

Using $match


use agg
db.zips.aggregate([
    {$match:
     {
         state:"NY"
     }
    },
    {$group:
     {
         _id: "$city",
         population: {$sum:"$pop"},
         zip_codes: {$addToSet: "$_id"}
     }
    },
    {$project:
     {
         _id: 0,
         city: "$_id",
         population: 1,
         zip_codes:1
     }
    }
     
])

use agg
db.zips.aggregate([
    {$match:
     {
         state:"NY"
     }
    },
    {$group:
     {
         _id: "$city",
         population: {$sum:"$pop"},
         zip_codes: {$addToSet: "$_id"}
     }
    }
])


Using $sort

Sort happens in memory and hence can hog memory
If the sort is before grouping and after match, it can use index
If the sort is after grouping it cannot use index

use agg
db.zips.aggregate([
    {$match:
     {
         state:"NY"
     }
    },
    {$group:
     {
         _id: "$city",
         population: {$sum:"$pop"},
     }
    },
    {$project:
     {
         _id: 0,
         city: "$_id",
         population: 1,
     }
    },
    {$sort:
     {
         population:-1
     }
    }
      
    
     
])
 

$limit and $skip


use agg
db.zips.aggregate([
    {$match:
     {
         state:"NY"
     }
    },
    {$group:
     {
         _id: "$city",
         population: {$sum:"$pop"},
     }
    },
    {$project:
     {
         _id: 0,
         city: "$_id",
         population: 1,
     }
    },
    {$sort:
     {
         population:-1
     }
    },
    {$skip: 10},
    {$limit: 5}
])



Using $unwind


db.posts.aggregate([
    /* unwind by tags */
    {"$unwind":"$tags"},
    /* now group by tags, counting each tag */
    {"$group": 
     {"_id":"$tags",
      "count":{$sum:1}
     }
    },
    /* sort by popularity */
    {"$sort":{"count":-1}},
    /* show me the top 10 */
    {"$limit": 10},
    /* change the name of _id to be tag */
    {"$project":
     {_id:0,
      'tag':'$_id',
      'count' : 1
     }
    }
    ])



db.posts.aggregate([{"$unwind":"$comments"},{$group:{"_id":{"author":"$comments.author"},count:{"$sum":1}
}},
{$sort:
     {
                 count:-1
     }
    }
{$limit: 1}
])
Some examples:

Avg score homework 5.3
db.grades.aggregate([ {$unwind:'$scores'},{$match:{'scores.type':{$in:['exam','homework']}}},{$group:{_id:
{"studentId":'$student_id',"classId":"$class_id"},Avgscore:{$avg:'$scores.score'}}},{$group:
{_id:"$_id.classId","Avgclassscore":{"$avg":"$Avgscore"}}},
{$sort:
     {
                 Avgclassscore:-1
     }
    }

])

SQL to Aggregation Mapping



Limitations to aggregation framework

1.       The result set is limited to 16MB of memory
2.       You cannot use more than 10% of memory on a machine
3.       Sharding: Aggregation does work on a sharded environment, but after the first $group or $sort phase, the aggregation has to be brought back to the MongoS



Alternates of aggregation framework
1.       MapReduce
2.       Hadoop



Please Note : This is a series of 6 
Reference: All the material credit goes to the course hosted by Mongo

Comments

Popular posts from this blog

Diversification Story Airline 4: AirAsia, From Budget Airline to Digital Lifestyle Ecosystem

  In 2001, Tony Fernandes bought a struggling, debt-ridden Malaysian airline for just one ringgit (about 25 cents)   along with its $11 million debt. Within a year, the low-cost model was flying, and AirAsia soon became Asia’s best-known budget airline. But the true story of AirAsia is not just about democratizing flying — it’s about how a small airline diversified boldly into adjacent and transformational businesses, reinventing itself as a digital lifestyle brand. Horizon 1: The Core – Low-Cost Flying AirAsia’s foundation was its low-cost, no-frills passenger business. Inspired by Southwest and Ryanair, built the airline on simple principles: A single aircraft type for efficiency (Airbus A320). Quick turnarounds to maximize utilization. Aggressive pricing to stimulate demand. “Now Everyone Can Fly” — a brand promise that resonated across Southeast Asia. From Malaysia, AirAsia expanded regionally, launching subsidiaries in Thailand, Indo...

Diversification Story Airline 10: Pan Am, The Cautionary Tale of Diversification and Decline

Few airlines inspire as much nostalgia as Pan American World Airways. Founded in 1927, Pan Am was once the world’s most glamorous and innovative airline — the “chosen instrument” of U.S. international aviation. It pioneered transoceanic flying, introduced the jumbo jet era, and set the standard for luxury in the skies. But Pan Am is also one of the most famous failures in diversification, a story of ambition that outpaced strategy, and expansion that collapsed under its own weight. Horizon 1: The Core – America’s Flag Carrier to the World Pan Am’s foundation was international passenger flights. In 1927, it operated its first mail and passenger flight from Key West, Florida, to Havana, Cuba. By the 1930s, Pan Am pioneered flying boats (Clippers) that connected the Americas to Europe and Asia. In the 1950s–60s, it became the world’s premier international airline, with routes to every continent. In 1970, Pan Am was the launch customer of ...

Mongo Learning Series 1

Mongo Learning First of all, I want to thank and congratulate the MongoDB team for hosting such a wonderful introductory interactive course.  Good job guys. For those interested here is the url https://education.mongodb.com/ It is a 7 week course. The syllabus follows: Week 1: Introduction Introduction & Overview - Overview, Design Goals, the Mongo Shell, JSON Intro, installing tools, overview of blog project. Bottle, Pymongo Week 2: CRUD CRUD (Creating, Reading and Updating Data) - Mongo shell, query operators, update operators and a few commands Week 3: Schema Design Schema Design - Patterns, case studies and tradeoffs Week 4: Performance Using indexes, monitoring and understanding performance. Performance in sharded environments. Week 5: Aggregation Framework Goals, the use of the pipeline...