Week 3 : MongoDB Schema design
Although we could keep the data in Third normal form,
MongoDB recommends to store data close to the application in an Application
driven schema.
Key principles :
1.
Rich documents
2.
Pre/Join Embed data
3.
No Merge Joins
4.
No Constraints
5.
Atomic operations
6.
No declared Schema
Relational
Normalization:
Goals of relational normalization
1.
Free the database of modification anomalies
2.
Minimize redesign when extending
3.
Avoid bias toward any particular access pattern
MongoDB does not consider the 3rd goal in its
design.
Alternate schema for blog
If you are doing it the same way as relational, then you are
doing it incorrectly
Living without
constraints
MongoDB does not provide a way to check the foreign key
constraints. It is up to the programners to ensure that if the data is stored
in multiple documents the link between the two are well maintained.
Embedding usually helps with the same.
Living without
transactions
MongoDB does not support transactions. However, MongoDB has
Atomic operations. When you work on a single document that work will be
completed before anyone sees the document. They will see all the changes that
you make or none of them. Since the data is prejoined that updated is made on
one document instead of initiating a transaction and updates across multiple
tables in relational
3 considerations
1.
Restructure data to be contained within a
document update
2.
Implement in application code vs on the database
layer
3.
Tolerance to inconsistency
One to One relations
One to one relations are relations where each item
corresponds to exactly one other item
Example: Employee: Resume
Building: Floor plan
Patient: Medical History
Taking the employee resume example. You could have an
employee document and a resume document, which you link by adding employee ID
in the resume document, or the other way round and have the resume ID in the
employee document. Or alternatively have one employee document and embed the
resume in to the document/have a resume document and embed the employee details
Key considerations are:
1.
Frequency of access
Let’s say for example, the employee details are constantly accessed, but
very rarely access their resume, let’s say if it is a very large collection and
are concerned about locality and working set size, you may decide to keep them
in separate collections because you don’t want to pull the resume in to memory
every single time you pull the employee record
2.
Size of the items
Which of the items grow. For example, the employee details might not
change as much, however the resume is changing. If there are items especially
around multimedia which has the potential to grow over 16MB, then you will have
to store them separately.
3.
Atomicity of Data
If
you want to make sure that the data is consistent between the employee data and
resume data, and you want to update the employee data and the resume data all
at the same time, then you will have to embed the data to maintain the
atomicity
One to Many
relationships
Are relations where many entities map to one entity.
Example:
City: Person
Let’s say NYC which has 8 million people.
If we have a city collection, with attributes like name of
the city, area and people in an array, that wont work. Because there are way
too many people
If we flip that around, and have a people collection and
embed the city attributes as part of each people document, that wont work
either because there will be lot of people in a given city and the city data
will become redundant. The City data has been duplicated
The best way to do it is to use linking
It makes sense to have 2 collections in this case.
One to Few
Example:
Posts: Comments
Allthough the relation is one to many the number of comments
might just be a few and it would be ok
Many to Many
Example:
Books:Authors
Students: Teachers
It might end of being few to few
Makes most sense to keep them as separate collections,
unless there are performance issues. Not recommended to embed the data, there
will be risk of duplicating data
Multi Key Indexes
When you index something that’s an array, you get a multi key index
Students collection
{_id: 0, “name”: “ Prashanth Panduranga”, “teachers” : [1,4,7] }
Where teachers is an array of the teachers
db.students.ensureIndex (
{‘teachers’:1 } )
The above query returns all students which have teachers 1
and 3 and the explain plan indicates that the query used the index
Benefits of embedding
data
·
Improved read performance
Nature of computer systems: Spinning disks
have high latency, which means take a long time to get to the first byte. Once
they get to the first byte, each additional byte comes quickly. High bandwidth
·
One round trip to the DB
Trees
One of the classic problem in the world of schema design is How
to represent trees, example product catalog in an ecommerce site such as amazon
Products – products collection
Category : 7
Product_name : “Snow blower”
Category – category collection
_id: 7
Category_name: “Outdoors”
One way to model it is it by keeping the parent id
Parent: 6
But this doesn’t make it easy to find the parents of this
category, you will have to iteratively query find the parent of each all the way
to the top
Alternatively
You can list all the children
Children: [1,2,5,6]
Which is also fairly limiting if you are intending to locate
the entire sub tree, above certain piece of the tree
Alternate:
Ancestor: [3,7,9,6]
List all the ancestors in order, with this we can find all
the parent categories of the category easily
When to Denormalize
One of the reasons Data is normalized is to avoid
modification anomalies
As long as we don’t duplicate data we don’t open ourselves
to modification anomalies
1:1 embed – perfectly safe to embed the data, because you
are not opening up to modification anomalies, you are not duplicating data,
rather what would be in separate tables you are folding it in to one document
1:Many – as long as you are embedding many to the one, it
would still avoid duplicating data.
Many: Many – link to avoid duplication
Handling Blobs
GRIDFS
If you want to store large files, you are limited by 16
MB. Mongo DB has a special facility
called GRIDFS, which will break up a large file in to smaller chunks and store
those chunks in a collection and will also store meta data about these chunks
in a secondary collection.
Running the python file, saves the video file in to the
collection and adds the meta data
Please Note : This is a series of 6
Reference: All the material credit goes to the course hosted by Mongo
No comments:
Post a Comment