Week 3 : MongoDB Schema design

Although we could keep the data in Third normal form, MongoDB recommends to store data close to the application in an Application driven schema.

Key principles :

1. Rich documents

2. Pre/Join Embed data

3. No Merge Joins

4. No Constraints

5. Atomic operations

6. No declared Schema

Relational Normalization:

Goals of relational normalization

1. Free the database of modification anomalies

2. Minimize redesign when extending

3. Avoid bias toward any particular access pattern

MongoDB does not consider the 3^rd goal in its design.

Alternate schema for blog

If you are doing it the same way as relational, then you are doing it incorrectly

Living without constraints

MongoDB does not provide a way to check the foreign key constraints. It is up to the programners to ensure that if the data is stored in multiple documents the link between the two are well maintained.

Embedding usually helps with the same.

Living without transactions

MongoDB does not support transactions. However, MongoDB has Atomic operations. When you work on a single document that work will be completed before anyone sees the document. They will see all the changes that you make or none of them. Since the data is prejoined that updated is made on one document instead of initiating a transaction and updates across multiple tables in relational

3 considerations

1. Restructure data to be contained within a document update

2. Implement in application code vs on the database layer

3. Tolerance to inconsistency

One to One relations

One to one relations are relations where each item corresponds to exactly one other item

Example: Employee: Resume

Building: Floor plan

Patient: Medical History

Taking the employee resume example. You could have an employee document and a resume document, which you link by adding employee ID in the resume document, or the other way round and have the resume ID in the employee document. Or alternatively have one employee document and embed the resume in to the document/have a resume document and embed the employee details

Key considerations are:

1. Frequency of access

Let’s say for example, the employee details are constantly accessed, but very rarely access their resume, let’s say if it is a very large collection and are concerned about locality and working set size, you may decide to keep them in separate collections because you don’t want to pull the resume in to memory every single time you pull the employee record

2. Size of the items

Which of the items grow. For example, the employee details might not change as much, however the resume is changing. If there are items especially around multimedia which has the potential to grow over 16MB, then you will have to store them separately.

3. Atomicity of Data

If you want to make sure that the data is consistent between the employee data and resume data, and you want to update the employee data and the resume data all at the same time, then you will have to embed the data to maintain the atomicity

One to Many relationships

Are relations where many entities map to one entity.

Example:

City: Person

Let’s say NYC which has 8 million people.

If we have a city collection, with attributes like name of the city, area and people in an array, that wont work. Because there are way too many people

If we flip that around, and have a people collection and embed the city attributes as part of each people document, that wont work either because there will be lot of people in a given city and the city data will become redundant. The City data has been duplicated

The best way to do it is to use linking

It makes sense to have 2 collections in this case.

One to Few

Example:

Posts: Comments

Allthough the relation is one to many the number of comments might just be a few and it would be ok

Many to Many

Example:

Books:Authors

Students: Teachers

It might end of being few to few

Makes most sense to keep them as separate collections, unless there are performance issues. Not recommended to embed the data, there will be risk of duplicating data

Multi Key Indexes

When you index something that’s an array, you get a multi key index

Students collection

{_id: 0, “name”: “ Prashanth Panduranga”, “teachers” : [1,4,7] }

Where teachers is an array of the teachers

db.students.ensureIndex ( {‘teachers’:1 } )

The above query returns all students which have teachers 1 and 3 and the explain plan indicates that the query used the index

Benefits of embedding data

· Improved read performance

Nature of computer systems: Spinning disks have high latency, which means take a long time to get to the first byte. Once they get to the first byte, each additional byte comes quickly. High bandwidth

· One round trip to the DB

Trees

One of the classic problem in the world of schema design is How to represent trees, example product catalog in an ecommerce site such as amazon

Products – products collection

Category : 7

Product_name : “Snow blower”

Category – category collection

_id: 7

Category_name: “Outdoors”

One way to model it is it by keeping the parent id

Parent: 6

But this doesn’t make it easy to find the parents of this category, you will have to iteratively query find the parent of each all the way to the top

Alternatively

You can list all the children

Children: [1,2,5,6]

Which is also fairly limiting if you are intending to locate the entire sub tree, above certain piece of the tree

Alternate:

Ancestor: [3,7,9,6]

List all the ancestors in order, with this we can find all the parent categories of the category easily

When to Denormalize

One of the reasons Data is normalized is to avoid modification anomalies

As long as we don’t duplicate data we don’t open ourselves to modification anomalies

1:1 embed – perfectly safe to embed the data, because you are not opening up to modification anomalies, you are not duplicating data, rather what would be in separate tables you are folding it in to one document

1:Many – as long as you are embedding many to the one, it would still avoid duplicating data.

Many: Many – link to avoid duplication

Handling Blobs

GRIDFS

If you want to store large files, you are limited by 16 MB. Mongo DB has a special facility called GRIDFS, which will break up a large file in to smaller chunks and store those chunks in a collection and will also store meta data about these chunks in a secondary collection.

Running the python file, saves the video file in to the collection and adds the meta data

Please Note : This is a series of 6

Reference: All the material credit goes to the course hosted by Mongo

Prashanth B Panduranga

Friday, May 23, 2014

Mongo Learning Series 3

Week 3 : MongoDB Schema design

No comments:

About Me