Friday, May 28, 2010

Quick sort algorithm

Quicksort follows the simple strategy of dividing the data into parts and sorting it.

Simple steps for the algorithm are:
  • Pick an element, called a pivot, from the list.
  • Reorder the list so that all elements with values less than the pivot come before the pivot, while all elements with values greater than the pivot come after it (equal values can go either way). After this partitioning, the pivot is in its final position. This is called the partition operation.
  • Recursively sort the sub-list of lesser elements and the sub-list of greater elements.

Time complexity of the quick sort algorithm is O(n log n). And the worst case time complexity is O(n^2). Quicksort has a space complexity of O(log n).

Lets see a java program to do quick sort

public class QuickSort {

    public int[] data;
    public QuickSort(int[] data)
    {
        this.data = data;
    }

    public void sortMe()
    {
        this.sortMe(data, 0, data.length-1);
    }

    public void sortMe(int[] array, int start, int end)
    {
        int i=start;
        int k=end;

        if((end - start) < 1) //return if there is < 1 element
            return;

        int pivot = array[start]; //take an integer as pivot
        while(k > i) //while the left and right indices have not met
        {
            //from left to right find first element greater than pivot
            while(array[i] <= pivot && i <= end && k > i) 
                i++;
            //from right to left find first element smaller than pivot
            while(array[k] > pivot && k >= start && k >= i)
                k--;
            //if left index is still smaller than right, swap the elements
            if(k > i)
                swap(array, i, k);
        }
        //swap the last element in the left partition with the pivot.
        swap(array, start, k);

        //quicksort left partition
        sortMe(array, start, k-1);
        //quicksort right partition
        sortMe(array, k+1, end);
    }

    public void swap(int[] array, int idx1, int idx2)
    {
        int temp=array[idx1];
        array[idx1] = array[idx2];
        array[idx2] = temp;
    }

    public void printMe(int[] data)
    {
        for(int x=0; x<data.length; x++)
        {
            System.out.print(data[x]+", ");
        }
        System.out.println();
    }

    public static void main(String[] args)
    {
        int[] data = {9,2,6,4,11,88,66,44,55,32,23,5,32,10};
        QuickSort qs = new QuickSort(data);
        System.out.print("Before Sorting : ");
        qs.printMe(data);
        qs.sortMe();        
        System.out.print("After Sorting : ");
        qs.printMe(qs.data);
    }

}

Wednesday, May 19, 2010

Setting up replication on postgresql9.0 beta

Eventually postgresql comes out with its inbuilt replication solution in 9.0 beta. Setting up of replication is quite simple. We will look at setting up a simple master-slave replication between two servers of postgresql.

The way replication works in postgres is that the master keeps on writing "write ahead log" files (also known as wal files). The slave can be configured to run either in recovery mode or hot standby. The difference being that in recovery mode the wal files require to be shipped using something like a cron. And the slave applies the wal files to its database. The problem here is that it is not smooth and the delay between master and slave can be big. The logic of hot standby is that the slave connects to the master and fetches the log files from master. And it applies those log files on its database. Here the lag between master and slave is not huge since the slave tries to keep up with the master. This is also known as streaming replication because the wal files are continuously shipped to the slave and applied there. There could be multiple slaves fetching wal files from master and applying to their own database. As opposed to mysql the slave here is readonly - it cannot be modified. It can be used only as read-only slave for select queries.

Lets look at deploying master-slave replication between two servers

Master server : 172.16.3.173
Slave server : 172.16.3.211

Step 1:
Install postgresql 9.0 on both servers. Simple steps for installation are
$ ./configure
$ make
$ make install
$ adduser postgres

Step 2:
Initialize the data directory on master. No need to initialize it on slave - since we will be copying the master directory to slave for starting replication.
$ mkdir /usr/local/pgsql/data
$ chown -R postgres.postgres /usr/local/pgsql/data
$ su - postgres
$ /usr/local/pgsql/bin/initdb -D /usr/local/pgsql/data

In case you are trying to create a slave of existing postgres database, you can simply skip these steps and install postgres on a slave machine - and not initialize the data directory on the slave machine.

Step 3:
Change config on master so that any slave is authorized connect to the master. Add the following line to pg_hba.conf of master.
$ vim data/pg_hba.conf
host    replication     postgres        172.16.3.211/22         trust
This says that 172.16.3.211 (the slave) is allowed to connect to master for replication without any password. Configure this for all slaves (if you plan to have multiple slaves).

Step 4:
Change parameters on the master postgresql for streaming replication.
$ vim data/postgresql.conf
# allow all ip addresses to connect to the master.
listen_addresses = '*'

# set wal_level for hot standby replication.
wal_level = hot_standby

# enable archiving to an archive directory accessible from the standby. Not required for streaming replication. But set it for "just in case" scenarios 
archive_mode = on
archive_command = 'cp "%p" /path/to/archive_directory/"%f"'

# set the number of concurrent slave connections
max_wal_senders = 5

# set the minimum no of WAL segments to be retained in the pg_xlog directory from where the slave can pick up the WAL segments.
# This prevents removal of WAL segments before they are shipped the slave.
wal_keep_segments = 32

Step 5:
Start postgres on primary/master server.

Step 6:
Create a backup of master and ship it to the slave server.
$ ./bin/psql -c "SELECT pg_start_backup('postgres',true)"
$ rsync -a /usr/local/pgsql/data postgres@172.16.3.211:/usr/local/pgsql/data
$ ./bin/psql -c "SELECT pg_stop_backup()"

Step 7:
Clean the data directory in slave and set replication parameters.
$ rm /usr/local/pgsql/data/postmaster.pid
Here the postgresql.conf is same as master, so just set the parameters required by slave.
$ vim /usr/local/pgsql/data/postgresql.conf
# disable archiving
archive_mode = off
# enable read-only queries on slave - during replication/recovery
hot_standby = on


Step 8:
Setup the parameters in recovery.conf required for streaming replication
$ vim recovery.conf
# shell command for copying log files back from the archive. Should not be required.
restore_command = 'cp -i /home/postgres/archive/%f "%p" </dev/null'
# enable standby
standby_mode = 'on'
# connection information for connecting to master
primary_conninfo = 'host=172.16.3.173 port=5432 user=postgres'
# stop streaming and enable the server to work in read/write mode, if this file is found.
# server keeps on polling for the trigger file and stops replication once this file is found. The server can then be used in read/write mode.
trigger_file = '/usr/local/pgsql9.0/data/stop.replication'



Step 9:
Start postgres on slave. It will start streaming replication between master and slave.
To check replication see if sender process is active on master
$ ps -ef | grep sender
postgres 12332 12158  0 10:13 ?        00:00:00 postgres: wal sender process postgres 172.16.3.211(41497) streaming 0/90149F0
postgres 16233  9474  0 11:36 pts/2    00:00:00 grep sender
On slave check if receiver process is working
$ ps -ef | grep receiver
postgres 27952 27945  0 10:12 ?        00:00:00 postgres: wal receiver process   streaming 0/9014974        
postgres 30354 23492  0 11:36 pts/0    00:00:00 grep receiver

That is it. Run a few DML/DDL queries on master and check if they are being replicated on slave. The lag between master and slave is not noticable.