Friday, May 23, 2014

Mongo Learning Series 3

Week 3 : MongoDB Schema design


Although we could keep the data in Third normal form, MongoDB recommends to store data close to the application in an Application driven schema.
Key principles :
1.       Rich documents
2.       Pre/Join Embed data
3.       No Merge Joins
4.       No Constraints
5.       Atomic operations
6.       No declared Schema

Relational Normalization:
Goals of relational normalization
1.       Free the database of modification anomalies
2.       Minimize redesign when extending
3.       Avoid bias toward any particular access pattern
MongoDB does not consider the 3rd goal in its design.




Alternate schema for blog


If you are doing it the same way as relational, then you are doing it incorrectly
Living without constraints
MongoDB does not provide a way to check the foreign key constraints. It is up to the programners to ensure that if the data is stored in multiple documents the link between the two are well maintained.
Embedding usually helps with the same.
Living without transactions
MongoDB does not support transactions. However, MongoDB has Atomic operations. When you work on a single document that work will be completed before anyone sees the document. They will see all the changes that you make or none of them. Since the data is prejoined that updated is made on one document instead of initiating a transaction and updates across multiple tables in relational
3 considerations
1.       Restructure data to be contained within a document update
2.       Implement in application code vs on the database layer
3.       Tolerance to inconsistency

One to One relations
One to one relations are relations where each item corresponds to exactly one other item
Example: Employee: Resume
Building: Floor plan
Patient: Medical History
Taking the employee resume example. You could have an employee document and a resume document, which you link by adding employee ID in the resume document, or the other way round and have the resume ID in the employee document. Or alternatively have one employee document and embed the resume in to the document/have a resume document and embed the employee details
Key considerations are:
1.       Frequency of access
Let’s say for example, the employee details are constantly accessed, but very rarely access their resume, let’s say if it is a very large collection and are concerned about locality and working set size, you may decide to keep them in separate collections because you don’t want to pull the resume in to memory every single time you pull the employee record
2.       Size of the items
Which of the items grow. For example, the employee details might not change as much, however the resume is changing. If there are items especially around multimedia which has the potential to grow over 16MB, then you will have to store them separately.
3.       Atomicity of Data
If you want to make sure that the data is consistent between the employee data and resume data, and you want to update the employee data and the resume data all at the same time, then you will have to embed the data to maintain the atomicity

One to Many relationships
Are relations where many entities map to one entity.
Example:
City: Person
Let’s say NYC which has 8 million people.
If we have a city collection, with attributes like name of the city, area and people in an array, that wont work. Because there are way too many people
If we flip that around, and have a people collection and embed the city attributes as part of each people document, that wont work either because there will be lot of people in a given city and the city data will become redundant. The City data has been duplicated
The best way to do it is to use linking

It makes sense to have 2 collections in this case.
One to Few
Example:
Posts: Comments
Allthough the relation is one to many the number of comments might just be a few and it would be ok

Many to Many
Example:
Books:Authors
Students: Teachers
It might end of being few to few
Makes most sense to keep them as separate collections, unless there are performance issues. Not recommended to embed the data, there will be risk of duplicating data
Multi Key Indexes
When you index something that’s an  array, you get a multi key index
Students collection
{_id: 0, “name”: “ Prashanth Panduranga”, “teachers” :  [1,4,7] }
Where teachers is an array of the teachers
db.students.ensureIndex (  {‘teachers’:1 }  )


The above query returns all students which have teachers 1 and 3 and the explain plan indicates that the query used the index
Benefits of embedding data
·         Improved read performance
Nature of computer systems: Spinning disks have high latency, which means take a long time to get to the first byte. Once they get to the first byte, each additional byte comes quickly. High bandwidth
·         One round trip to the DB

Trees
One of the classic problem in the world of schema design is How to represent trees, example product catalog in an ecommerce site such as amazon
Products – products collection
Category : 7
Product_name : “Snow blower”

Category – category collection
_id: 7
Category_name: “Outdoors”

One way to model it is it by keeping the parent id
Parent: 6
But this doesn’t make it easy to find the parents of this category, you will have to iteratively query find the parent of each all the way to the top
Alternatively
You can list all the children
Children: [1,2,5,6]
Which is also fairly limiting if you are intending to locate the entire sub tree, above certain piece of the tree
Alternate:
Ancestor: [3,7,9,6]
List all the ancestors in order, with this we can find all the parent categories of the category easily
When to Denormalize
One of the reasons Data is normalized is to avoid modification anomalies
As long as we don’t duplicate data we don’t open ourselves to modification anomalies
1:1 embed – perfectly safe to embed the data, because you are not opening up to modification anomalies, you are not duplicating data, rather what would be in separate tables you are folding it in to one document
1:Many – as long as you are embedding many to the one, it would still avoid duplicating data.
Many: Many – link to avoid duplication

Handling Blobs
GRIDFS
If you want to store large files, you are limited by 16 MB.  Mongo DB has a special facility called GRIDFS, which will break up a large file in to smaller chunks and store those chunks in a collection and will also store meta data about these chunks in a secondary collection.



Running the python file, saves the video file in to the collection and adds the meta data




Please Note : This is a series of 6 

Reference: All the material credit goes to the course hosted by Mongo

No comments: