Tuesday, March 26, 2013

how to cleanup a huge mongodb collection ?

As most of mongodb users must be knowing, mongodb works on RAM. The more RAM you give on the DB server, the happier mongodb is. But if the data/index size exceeds the RAM requirements, you see increasing response times for all your queries.

Recently we had an issue where the db size exceeded the RAM we had on our machine. Suddenly we saw the query response time increase to 10-20 times its original time. By luck we had a cleanup strategy in place but never got the chance to execute the same.

We were dealing with around 110 million entries and were expecting that after cleanup around 50% of entries would be removed. The problem was our setup.

We had multiple slaves in our replica set. So running a simple delete query on the master would send the entries to the slave as well. What we wanted to do was remove all entries which are say "n" days old. For an example say 6 months. The delete query for this would be

db.test1.remove( { ts : { $lt : ISODate("2012-09-27T00:00:00.000Z")  } } )

This will fire 1 query on master but for each record deleted on master, it will have a delete query written in the oplog. Which will replicate on slave. So if this query is run on master and we intend to remove 50 million entries from our existing 110 million entries, we would end up having 50 million entries in the oplog. Which is a lot of IO.

Another solution that crossed our mind was to disable oplog by creating a stand alone instance of mongodb and running our delete query there. This should have theoretically worked. But even when the oplog was disabled, the deletions were terribly slow. After firing the query and waiting for around 3 hours, we knew that this will not work.

This plan aborted, another small beam of light came through. Remember mysql and how we used to move data across tables.

Select * from table1 select * from table2 where

We tried replicating this statement in mongo and were successful.

db.col1.find( { ts : { $gt : ISODate("2012-09-27T00:00:00.000Z")  } } ).forEach( function(c){db.col2.insert(c)} )

This query took approximately 30 minutes to execute. And we had a new collection col2 ready with data greater than 6 months. Now all we needed to do was to rename the collections. Prefer swapping to backup existing data - in case something went wrong.


In order to maintain the data, we converted the collection to a ttl collection.

db.test1.ensureIndex( { "ts" : 1 }, { expireAfterSeconds : 15552000 } )

So any entry which exceeds 6 months = 15552000 seconds will be automatically deleted.

Saturday, March 02, 2013


I had been a fan of G-Shock watches for quite some time now. But this was my first experience of owning one. After lots of dilema on whether to get one, I went ahead and bought the Mudman 9300.

Features :
moon date
dual time
5 alarms with snooze
stop watch
count down timer
power saving feature
solar powered
Battery level indicator
water resistant till 200 meters
Shock resistant
world clock
hourly chime

And really good looking. Worth the money spent...

Microsoft Licences

Recently I got the opportunity to be a part of the windows team. We are (yes still are) using a microsoft (yes the same microsoft) product to handle one of our websites due to legacy bindings - user base, existing technology, backend team.

My first encounter with microsoft on the enterprise end was when we were trying to use Microsoft Navision - supply chain management solution - in one of my previous companies. The reason why I say that we were "trying" to use was because it took us more than 6-8 months to put it into production. And spend another 3 months in training. Microsoft sucks the user. I saw that if I purchase 1 product from microsoft, the dependencies are so well built in that I eventually end up purchasing a lot of other microsoft product.

Microsoft NAV cost us around 1 million INR. Now I cannot use NAV as it is, it needs to be customized. And it cannot be customized by just any developer. NAV can only be customized only by companies / developers who have the licence to do so. The licence for customization is extremely expensive - maybe even more expensive than the licence for selling liquor in india. Once I pay for customization, I have to go ahead and deploy the software somewhere. For which I need microsoft licences - OS, web server, database server. And then ofcourse plan for HA (high availability) - which means atleast 2 of each. So the strategy here is that once you purchase a product licence, you need the complete platform licence and eventually you end up paying many times more than the actual product cost.

Another concept that I became aware of recently was "software assurance". What is that ?? Well, have you heard of life insurance ? Software Assurance (SA) is somewhat similar to that. It ensures that you get all the patches and version upgrades - (may or may not be free) as and when they are released. So if you purchase windows 2012 and plan to shift to windows 2014 when it is released, it is possible. There may be some cost involved.

Among all microsoft licences, I believe that the DB licence is the killer. The standard licence costs 1/4th of the cost of enterprise licence.  The difference between enterprise and standard licence is that the standard can utilize only upto 2 cores in a machine. But an Enterprise edition can utilize upto any number of cores - and the licence cost is in mulitples of "dual cores". So if you have a dual quad core machine (8 cores), you end up purchasing 4 licences which is 16 times that of the standard licence cost.

And why should I pay for microsoft? when there are so many technologies which are better and available for free of cost. If I have to pay for support, why should i pay for the product and then for the support. Why not get the product for free and then pay for support ?

Final accessment was that microsoft is like a spider's web, once you get entangled, you keep on getting more and more entangled. And there is no getting out without losing your own investment. Beware!!!