Saturday, April 28, 2007

LiveJournal - system architecture

Lets discuss the system architecture of Live Journal.

Live Journal or LJ for short kicked off as a hobby project in April 1999 and was built on open source completely. It reached 2.8 Million accounts in April 2004 and 6.8 Million accounts in April 2005. Currently It has more than 10 Million accounts. Caters to several thousands of hits per second and lots of MySQL queries.

Here is a complex diagram which roughly outlines the architecture of LJ.

The technologies which are visible over here are -

Caching - Memcached
Mysql Clusters - HA & LB
Httpd load balancing - using perlbal
MogileFS - Distributed File System

Lets start off with mysql...

A single server with mysql wont be able to handle the large no of reads and writes. With increasing no of reads and writes, the server slows down. Next stage would be to have 2 servers in a master-slave architecture in which the master handles all inserts and the slaves are read-only. But then the queries have to be spread over in such a manner that replication lag between master and slave (though very small) is handled. As the number of database and web servers are increased - chaos increases. Site is fast for a while and then again slow - and there is need for more servers with higher configurations. Also, as the number of slaves increases, the number of writes to the slave also increases. So eventually you come to a situation where the number of writes is very large as compared to the number of reads. Resulting in large I/O and low CPU utilization.

The best way to handle such situation is to divide the database. How LJ did this was by creating user clusters. So each user was assigned a cluster number. And each cluster had multiple machines in a master-slave fashion. The first query would then find the cluster number for that user from the global database and subsequent queries for that user could then be redirected to the user cluster. Ofcourse few issues like uniqueness of userid, and moving user around clusters had to be tackled. Caching of mysql connections and using mysql query cache to cache query results added to the better performance of the site.

Again the problem was the single point of failure with the master databases. If any of the master database dies, the site would go down. To avoid this situation master-master cluster was created. In case of any problem - the other master would come into play and handle all active connections.

Which database engine to use - InnoDB or MyISAM. InnoDB allows concurrent reads and writes and so is comparatively fast. Whereas MyISAM has table level locks and so is not as fast as InnoDB.

And then there is MySQL cluster which is an in-memory engine. It requires about 2-4x of RAM for the dataset. So it is good for handling small data sets only.

An even better way of storing database is by using shared storage - SAN, SCSI, DRDB. You turn a pair of InnoDB machines to a cluster - looks like a single box from outside with floating IP address. Heartbeat to move IP, mount/unmount filesystem, start/stop mysql. DRDB can be used to sync one machine's block device with another. This requires dedicated gigabit cable between the two machines to handle the high amount of data transfer.

Cache

Memcache is used to cache records which has already been computed for frequent access. Memcache is an open source distributed caching system - instances of which can be run on any machine where-ever free memory is available. It also provides simple APIs for different languages like java, php, perl, python and ruby. And it is extremely fast.

LJ created 28 instances of memcache on 12 machines (not dedicated) and was able to cache 30 GB of data. This cache was getting a hit rate of 90-93%. Which reduced the number of queries to the database to a great extent. They started caching stuff which was very frequently accessed and aim at caching almost everything possible. With cache - there is an extra overhead of updating the cache.

http load balancing

After trying a large number of reverse proxies, LJ people were unable to find anything which satisfied their needs. So they built up their own reverse proxy - perlbal - a small, fast, manageable, HTTP web server which can do internal redirects.
It is single threaded, asynchronous and event based. Handles dead nodes. And works in multiple modes - static web server, reverse proxy and plug-ins.

Allows persistent connections and has no complex load balancing logic - uses whatever is free. Connects fast and has multiple queues - for free and paid users.


MogileFS - Distributed File System

Files belong to classes. It tracks what disks are files on. Keeps replicas on devices on different hosts. It has libraries available for most of the languages - php, perl, java, python.

clients, trackers, mysql database cluster and storage nodes - all were brought under MogileFS. It handles automatic file replication, deletion etc.

Have put in only major points and finer details can be found in the link below.


source:
http://danga.com/words/2005_oscon/oscon-2005.pdf

1 comment:

Anonymous said...

DRBD and not DRDB