Friday, May 23, 2014

Mongo Learning Series 6

Week 6: Application Engineering

Mongo Application Engineering

1. Durability of Writes

2. Availability / Fault Tolerance

3. Scaling

WriteConcern

Traditionally when we insert/update records that operation is performed as a fire and forget, Mongo Shell however wants to know if the operation is successful and hence calls getLastError every single time.

There are couple of arguments for (getLastError) with which the operations can be perfomed

W: 1 - - wait for a write acknowledgement. Still not durable, if the changes were made in memory returns true. Not necessarily after it is written to disk. If the system fails before writing to disk the data will be lost.

J:1 -- journal. Return only acknowledgement on disk write and is guaranteed. The operation can be replayed if lost.

Api.mongodb.org

Network Errors

Although w=1, j =1 is set there are other factors which might not save the state complete. Lets say you did an insert, that insert was done using a connection which had j=1, w=1. The driver issues a get last error. The write did get complete, but unfortunately before it completed, the network connection got reset. In that case, you will not know if the write completed or not. Because you did not get an acknowledgement that it completed.

Replication:

ReplicaSets: Replica sets are the set of mongo nodes. All nodes act together and mirror each other. One primary and multiple secondary. Data written to primary is asynchronously replicated. The decision of which is primary is dynamic. The application and its drivers always connects to the primary. If the primary goes down, then the secondary performs a election on which one needs to be a primary and there should be a strict majority.

The minimum number of nodes to form a replica set is 3.

Types of Replica Sets:

1. Regular

2. Arbiter (Voting)

3. Delayed / Regular (Disaster recovery node – It cannot be a primary node)

4. Hidden (Often used for Analytics, cannot be a primary node)

MongoDB does not offer eventual consistency by default.

It offers write consistency. As in the primary configuration for the MongoDB is to write and the read from the primary. If we change the read from secondary there might be some discrepancies.

Failover usually about 3 seconds

rs.slaveOk()

rs.isMaster()

seedlist

rs.stepDown()

w:’majority’

rs.status()

rs.conf()

rs.help()

Read Preference: the default read is from the primary, but when you have lot of nodes and if you want to configure to read from secondary as well you set the read preference. The read preferences are set on the drivers (Pymongo has 4, there are others in other drivers)

List of Read preferences allowed:

1. Primary

2. Secondary

3. Primary Preferred

4. Secondary preferred

5. Nearest

6. Tagged

Sharding

There can be more than one mongos

The shard can be arranged as rangebased

The data is identified by the shard key

Shard help

Sh.help()

Implications of sharding on development

1. Every document includes the Shard key

2. Shard key is immutable, which means that it cannot be changed so need to be careful

3. Index that starts with the Shard Key

4. When you do an update Shard key has to be specified or set multi to true

a. When multi it is going to send the updates to all of the nodes

5. No shard key means send to all nodes => scatter gather

6. No unique key unless part of the shard key

Choosing a shard key

1. Sufficient cardinality

2. Hot spotting : monotonically increasing

Import

mongoimport --db dbName --collection collectionName --file fileName.json

doc=db.thinks.findOne();

for (key in doc) print(key);

Week 7: Case Studies

Jon Hoffman from Foursquare

Scala, MongoDB

5 million check-ins a day

Over 2.5 billion

AWS is used as a Application Server

The Database is hosted on own racks, SSD based

Migrated from AWS due to some performance issues, which were in the past. AWS has fixed those with the SSD offering

Ryan Bubinski from Codecademy

Ruby for server side

Javascript for client side and some server side

API in Ruby

App layer in Ruby and Javascript

All client side is javascript

Mongoid ODM (Object document mapper)

Rails for application layer

Rack api

nginx

10Gen MMS

Cookiebased session storage

Redis session store (inmemory session store – key value based)

Millions of submisssions

The submissions vary from 100 of kilo bytes to MBs

1^st gen O(I million) order of magnitude of 1 million

Hosted service

2^nd Gen O(10 million)

Ec2

Quad extra large memory instances

EBS

4X large memory

Provisioned IOPS

Replica sets

Single primary

2 secondary

Writes to primary

Reads from secondary

To handle horizontal scale on the read load and use one machine to handle the write load

Sharded temporarily:

2 shards with replica sets

3^rd gen O(100+ millions)

S3 backed answer storage

Used S3 as a key value store

writeConcern

For all writes which involves a confirmation or user acknowledgement use safe mode

For logging and other event based writes disable safe mode

Rsync for replication

Heroku

Application layer and API layer handles both reads and writes are hosted on Heroku

Heroku are AWS backed

Both Codeacademy and Heroku (AWS) are hosted in the same availability zone

Please Note : This is a series of 6

Reference: All the material credit goes to the course hosted by Mongo

Now feels good to have the course certificate:

Mongo Learning Series 5

Week 5: Aggregation Framework

The aggregation pipeline is a framework for performing aggregation tasks, modeled on the concept of data processing pipelines. Using this framework, MongoDB passes the documents of a single collection through a pipeline

Let’s say there is a table

Name	Category	Manufacturer	Price
iPad	Tablet	Apple	499
S4	Cell Phone	Samsung	350

If I wanted to find out how many products from each manufacturer from each manufacturer, the way it is done in SQL is through a query :

Select manufacturer, count(*) from products group by manufacturer

We need to use Mongo aggregation framework to use similar to “group by“

use agg

db.products.aggregate([ {$group: { _id:”$manufacturer”,num_products:{$sum:1} }}])

Aggregation pipeline

Aggregation uses a pipeline in MongoDB. The concept of pipes is similar to unix. At the top is the collections. The documents are piped through the processing pipeline and they go through series of stages and will eventually get a result set. Each of the stage can happen multiple times.

Unwind denormalizes the data. For an array the command unwind will create a separate document for each key in the array with all other data being repeated in the document, thus creating redundant data.

In the above diagram

1:1 maps to same number of records

N:1 maps to only a subset of records returned

1:N represents a larger set of records returns due to unwind operation

Simple aggregation example expanded

If the above aggregation query, is run against a product collection, it

goes through each record looks for the manufacturer, if doesn’t exist, creates a record and adds the num_products value.

At the end of the iteration, a list of all the unique manufacturers and their respective number of products will be produced as a result set

Compound grouping

For compound grouping where traditionally we use queries such as

Select manufacturer, category, count(*) from products group by manufacturer, category

The below example groups by manufacturer and category

Using a document for _id

_id doesn’t always have to be a number or a string, the important aspect is that is has to be unique. It can also be a document.

Aggregate Expressions

The following are the different aggregation expressions

1. $sum – count and sum up the key

2. $avg - average

3. $min – minimum value of the key

4. $max – maximum value

5. $push – build arrays

6. $addToSet – add to set only adds uniquely

7. $first – after sorting the document produces the first document

8. $last – after sorting the document produces the last document

Using $sum

Using $avg

Using addToSet

Using $push

Difference between push and addToSet is that push doesn’t check for duplicates and it just adds the same. . addToSet adds by checking for duplicates

Using Max and min

Double Grouping

You can run more than one aggregation statement

Example:

Using $project

Project example

use agg

db.products.aggregate([

{$project:

{

_id:0,

'maker': {$toLower:"$manufacturer"},

'details': {'category': "$category",

'price' : {"$multiply":["$price",10]}

'item':'$name'

}

])

use agg

db.zips.aggregate([{$project:{_id:0, city:{$toLower:"$city"}, pop:1, state:1, zip:"$_id"}}])

Using $match

use agg

db.zips.aggregate([

    {$match:

         state:"NY"

},

    {$group:

         _id: "$city",

         population: {$sum:"$pop"},

         zip_codes: {$addToSet: "$_id"}

},

    {$project:

         _id: 0,

         city: "$_id",

         population: 1,

         zip_codes:1

])

use agg

db.zips.aggregate([

    {$match:

         state:"NY"

},

    {$group:

         _id: "$city",

         population: {$sum:"$pop"},

         zip_codes: {$addToSet: "$_id"}

])

Using $sort

Sort happens in memory and hence can hog memory

If the sort is before grouping and after match, it can use index

If the sort is after grouping it cannot use index

use agg

db.zips.aggregate([

    {$match:

         state:"NY"

},

    {$group:

         _id: "$city",

         population: {$sum:"$pop"},

},

    {$project:

         _id: 0,

         city: "$_id",

         population: 1,

},

    {$sort:

         population:-1

])

$limit and $skip

use agg

db.zips.aggregate([

    {$match:

         state:"NY"

},

    {$group:

         _id: "$city",

         population: {$sum:"$pop"},

},

    {$project:

         _id: 0,

         city: "$_id",

         population: 1,

},

    {$sort:

         population:-1

},

    {$skip: 10},

    {$limit: 5}

])

Using $unwind

db.posts.aggregate([

    /* unwind by tags */

    {"$unwind":"$tags"},

    /* now group by tags, counting each tag */

    {"$group":

     {"_id":"$tags",

      "count":{$sum:1}

},

    /* sort by popularity */

    {"$sort":{"count":-1}},

    /* show me the top 10 */

    {"$limit": 10},

    /* change the name of _id to be tag */

    {"$project":

     {_id:0,

      'tag':'$_id',

      'count' : 1

])

db.posts.aggregate([{"$unwind":"$comments"},{$group:{"_id":{"author":"$comments.author"},count:{"$sum":1}

}},

{$sort:

{

count:-1

}

{$limit: 1}

])

Some examples:

Avg score homework 5.3

db.grades.aggregate([ {$unwind:'$scores'},{$match:{'scores.type':{$in:['exam','homework']}}},{$group:{_id:

{"studentId":'$student_id',"classId":"$class_id"},Avgscore:{$avg:'$scores.score'}}},{$group:

{_id:"$_id.classId","Avgclassscore":{"$avg":"$Avgscore"}}},

{$sort:

{

Avgclassscore:-1

}

])

SQL to Aggregation Mapping

http://docs.mongodb.org/manual/reference/sql-aggregation-comparison/

Limitations to aggregation framework

1. The result set is limited to 16MB of memory

2. You cannot use more than 10% of memory on a machine

3. Sharding: Aggregation does work on a sharded environment, but after the first $group or $sort phase, the aggregation has to be brought back to the MongoS

Alternates of aggregation framework

1. MapReduce

2. Hadoop

Please Note : This is a series of 6

Reference: All the material credit goes to the course hosted by Mongo

Mongo Learning Series 4

Week 4: Performance

Indexes

Database performance is driven by indexes for MongoDB as any other database

Databases stores the data in large files on disk, which represents the collection. There is no particular order for the documents on the disk, it could be anywhere. When you query for a particular document, what the database will have to do by default is scan through the entire collection to find the data. This is called a table scan in a relational DB and a collection scan in Mongo DB and it is death to performance. It will be extremely slow. Instead the data is indexed to perform better.

How does indexing work:

If something is ordered/sorted then it is quick to find the data. MongoDB keeps the key ordered.

MongoDB does not keep the keys linearly ordered, but uses BTree. When looking for the items, look for the key in the index which has a pointer to the document and thus retrieve the document.

In MongoDB indexes are ordered list of keys

Example:

(name, Hair_Color, DOB)

Inorder to utilize an index, you have give it a left most set of items

As in provide: name

or name and hair color

than just DOB

Every time a data needs to be inserted in to the database the index also needs to be updated. Updating takes time. Reads are faster, however the writes takes longer when you have an index.

Lets say we have an index on (a,b,c)

If a query is done on b, index cannot be used

If a query is done on a, index can be used

If a query is done on c, index cannot be used

If a query is done on a,b: index can be used, it uses 2 parts of the index

If a query is done on a,c: index can be used, it uses just the a part and ignores the c part

Creating Indexes

db.students.ensureIndex({student_id:1})

db.students.ensureIndex({student_id:1,class:-1}) – Compound index

Negative indicates descending. Ascending vs descending doesn’t not make a big difference when you are searching, however makes a huge difference when you are sorting. If the database use the index for the sort then it needs to be in the right order.

You can also makes it 3 part index.

Discovering Indexes

db.system.indexes.find() – will give all the indexes in the database.

db.students.getIndexes()– will give all the indexes in the given collections.

db.students.dropIndex( {Student_id:1}) - will delete/drop the index

MultiKey Indexes

In MongoDB you can hold a key which is an array

tags: [“cycling”,”tennis”,”football”]

ensureIndex ({tags:1})

When you index an key which is an Array, A MultiKey Index is created.

Rather than create one index point for a document, while creating an index if MongoDB sees an array, it will create an index point for every item in the array.

MongoDB also lets to create a compound index with arrays.

Mongo restricts having 2 keys to be arrays and being indexed at the same time. Compound index on 2 arrays is restricted.

Indexes are not restricted to the top level alone.

Index can be created on sub areas of the document as well

For example.

db.people.ensureIndex({‘addresses.tag’:1})

db.people.ensureIndex({‘addresses.phones’:1})

Index creation Option, Unique

Unique index enforces a constraint that each key can only appear once in the index

db.stuff.ensureIndex ( {‘thing’:1}, {unique:true} )

Removing duplicates when creating unique indexes

db.stuff.ensureIndex ( {‘thing’:1}, {unique:true, dropDups:true} )

Adding dropDups will delete all duplicates. There is no control on the document to be deleted, hence it is important to exercise caution before using this command

Index creation Option, Sparse

When and index is created on a collection and more than one document in the collection is missing a key

{a:1, b:1, c:1}

{a:2, b:2}

{a:3, b:3}

If an index is created on c

First document has c in it and hence ok, for the second document mongo considers c to be null and the third document also does not has c and hence null. Since c is null and unique is specified this cannot be allowed

In scenarios where duplicates cannot be dropped, there is a unique problem

Querying documents in the collection with sparse index will not change the result set

However, sorting on collections with sparse index results in result set which ignores the document with out the index sparse keys

Indexes can be created foreground or on the back ground. Default : foreground.

When the index is created in the foreground it blocks all writers

Foreground indexes are faster

While running indexes with background:true option, it will be slow but does not block writers

In production systems when there are other writers to the database and doesn’t use replica sets, creating indexes as background tasks is mandatory so that the other writers are not blocked.

Using Explain

Important query metrics such as , Index usage pattern, execution speed, number of scanned documents etc. can be identified by using the explain command

Explain details:

{

"cursor" : "",

"isMultiKey" : ,

"n" : ,

"nscannedObjects" : ,

"nscanned" : ,

"nscannedObjectsAllPlans" : ,

"nscannedAllPlans" : ,

"scanAndOrder" : ,

"indexOnly" : ,

"nYields" : ,

"nChunkSkips" : ,

"millis" : ,

"indexBounds" : { },

"allPlans" : [

{ "cursor" : "",

"n" : ,

"nscannedObjects" : ,

"nscanned" : ,

"indexBounds" : { }

...

"oldPlan" : {

"cursor" : "",

"indexBounds" : { }

}

"server" : "",

"filterSet" :

}

Choosing an Index

How does MongoDB choose an Index

Lets say, the collection has an index on a, b and c

We will call that query plan 1 for a, 2 for b, and 3 for c

When we run the query for the first time, Mongo runs all the three query plans 1, 2 and 3 in parallel.

Lets say, query plan 2 was the fastest and completed processing, mongo will return the answer to the query and memorize that it should use that index for similar queries. Every 100 odd queries it will forget what it knows and rerun the experiment to know which one performs better.

How Large is your index

Index should be in memory. If index is not in memory and is on disk and if we are using all of it, it will impact the performance severely.

.totalIndexSize() command gives the size of the index

Index Cardinality

Cardinality is a measure of the number of elements of a set

How many index points for each different type of index that MongoDB supports

In a regular index, every single key you put in an index there will be an index point, and in addition if there is no key there will be an index point under the null entry, so you get 1:1 relative to the documents

In Sparse index, when a document is missing the key being indexed it is not in the index. Because it is a null, and nulls are not kept in the index for Sparse index. So here, Index cardinality will be potentially less than or equal to the number of documents

In Multikey Index, an index on array value there will be multiple index points for each document. And hence, the cardinality will be more than the number of documents.

Index Selectivity

Being selective on indexes are very important, which is no difference to RDBMS

Lets see an example of Logging with operation codes (OpCodes) such as Save, Open, Run, Put, Get

If can have an index on lets say (timestamp, OpCodes) or the reverse (Opcodes, timestamp)

If you know the particular time when you are interested to see what happened then (timestamp, OpCodes) makes the most sense, while the reverse could have had millions of records on a certain operation.

Hinting an Index

Generally, MongoDB uses its own algorithm to choose an index, however if you wanted to tell MongoDB to use an particular index you can do so by using the hint command

Hint({a:1,b:1})

If you want MongoDB to not use an index and use a cursor that goes through all the documents in the collection, then you can use the natural

Hint({$natural:1})

Hinting in Pymongo example

Efficiency of Index Use

Searching on regexes which are like /abcd/ with out stemming, comparison operators such as $gt, $ne etc are very inefficient even with indexes

In which cases based on the knowledge of the collection you can hint for the appropriate index to use rather than the default index used by Mongo

Geo Spatial indexes

Allows you to find things based on location

2D and 3D

2D: cartisian plan (x and y coordinates)

You want to know what closest stores to the person.

In order search based on location, you will need to store

‘location’: [x,y]

Index the locations

ensureIndex({‘location’:’2d’,type:1})

while querying then you can use

find({location:{$near:[x,y]}}).limit(20)

Database will return the documents in order of increasing distance.

Geospatial Sperical

Geo Spatial indexes considers the curvature of the earth.

In the database the order for the x and y coordinates are longitude and latitude

Db.runCommand( { geoNear: ‘stores’, near:[50,50], spherical:true, maxDistance :1})

The stores is the collection

It is queried with the run command instead of the find command

Logging slow queries

MongoDB automatically logs queries which are slow, > 100 ms.

Profiling

Profile writes entries/documents to system .profile which are slow (specified time)

There are three levels for the profiler 0, 1 and 2

0 default means off

1 log slow running queries

2 log all queries – more for debugging rather than performance

db.system.profile.find().pretty()

db.getProfilingLevel()

db.getProfilingStatus()

db.setProfilingLevel(1,4)

1 sets it to log slow running queries and 4 sets it to 4 milliseconds

Write the query to look in the system profile collection for all queries that took longer than one second, ordered by timestamp descending.

db.system.profile.find({millis:{$gt:1000}}).sort({ts:-1})

Mongostat

Mongostat named after iostat from the unix world, similar to perfmon in windows

Mongotop

Named after the Unix Top command. It indicates or provides a high level view of where Mongo is spending its time.

Sharding

Sharding is the technique splitting up a large collection amongst multiple servers

Mongos lets you shard

The way Mongo shards is that you choose a shard key, lets say student_id is the shard key.

As a developer you need to know that, for inserts you will also need to send the shard key, the entire shard key if it is a multi parted shard key in order for the insert to complete.

For an update or a remove or a find, if MongoS is not given a shard key then it will have to broadcast the request to all the shards. If you know the shard key, passing the shard key will increase the performance of the queries

MongoS is usually co-located with the application and you can have more than one MongoS

How to get all the keys of a document

var message = db.messages.findOne();

for (var key in message) {

print(key);

}

Please Note : This is a series of 6

Reference: All the material credit goes to the course hosted by Mongo

Prashanth B Panduranga

Friday, May 23, 2014

Mongo Learning Series 6

Week 6: Application Engineering

WriteConcern

Network Errors

Sharding

Week 7: Case Studies

Jon Hoffman from Foursquare

Ryan Bubinski from Codecademy

Mongo Learning Series 5

Week 5: Aggregation Framework

Aggregation pipeline

Simple aggregation example expanded

Using $match

Using $sort

$limit and $skip

Using $unwind

Limitations to aggregation framework

Mongo Learning Series 4

Week 4: Performance

About Me