Tuesday, March 26, 2013

how to cleanup a huge mongodb collection ?

As most of mongodb users must be knowing, mongodb works on RAM. The more RAM you give on the DB server, the happier mongodb is. But if the data/index size exceeds the RAM requirements, you see increasing response times for all your queries.

Recently we had an issue where the db size exceeded the RAM we had on our machine. Suddenly we saw the query response time increase to 10-20 times its original time. By luck we had a cleanup strategy in place but never got the chance to execute the same.

We were dealing with around 110 million entries and were expecting that after cleanup around 50% of entries would be removed. The problem was our setup.

We had multiple slaves in our replica set. So running a simple delete query on the master would send the entries to the slave as well. What we wanted to do was remove all entries which are say "n" days old. For an example say 6 months. The delete query for this would be

db.test1.remove( { ts : { $lt : ISODate("2012-09-27T00:00:00.000Z")  } } )

This will fire 1 query on master but for each record deleted on master, it will have a delete query written in the oplog. Which will replicate on slave. So if this query is run on master and we intend to remove 50 million entries from our existing 110 million entries, we would end up having 50 million entries in the oplog. Which is a lot of IO.

Another solution that crossed our mind was to disable oplog by creating a stand alone instance of mongodb and running our delete query there. This should have theoretically worked. But even when the oplog was disabled, the deletions were terribly slow. After firing the query and waiting for around 3 hours, we knew that this will not work.

This plan aborted, another small beam of light came through. Remember mysql and how we used to move data across tables.

Select * from table1 select * from table2 where

We tried replicating this statement in mongo and were successful.

db.col1.find( { ts : { $gt : ISODate("2012-09-27T00:00:00.000Z")  } } ).forEach( function(c){db.col2.insert(c)} )

This query took approximately 30 minutes to execute. And we had a new collection col2 ready with data greater than 6 months. Now all we needed to do was to rename the collections. Prefer swapping to backup existing data - in case something went wrong.


In order to maintain the data, we converted the collection to a ttl collection.

db.test1.ensureIndex( { "ts" : 1 }, { expireAfterSeconds : 15552000 } )

So any entry which exceeds 6 months = 15552000 seconds will be automatically deleted.


manguesh-borker said...


I don't know much about the actual implementation details or challenges at your end, nor would I like to discount the solution that worked for you, but just as an after thought to your first 2 approaches... I think the 1st 2 approaches would have worked if you would run them in a multithreaded environment, probaly cud have taken help of load testing tools like jmeter to simulate N number of threads to delete the rows... looks like since you were running a single thread and rtying to delete the rows, the IO was substantial enough to cause log delays... I could be wrong too.... :)

Online Shopping said...

Thanks for Sharing!!