setting up hadoop

Have updated the post with latest hadoop config changes compatible with hadoop 1.1.2

Hadoop is a distributes file system similar to google file system. It uses map-reduce to process large amounts of data on a large number of nodes. I will give a brief step by step process to set up hadoop on single and multiple nodes.

First lets go with a single node:

  • Download hadoop.tar.gz from hadoop.apache.org.
  • You can setup hadoop to work on any user, but it is preferred that you setup a separate user for running hadoop.
    sudo addgroup hadoop
    sudo adduser -g hadoop hadoop
  • untar hadoop.tar.gz file in the user “hadoop’s” home directory
    [hadoop@linuxbox ~]$ tar -xvzf hadoop.tar.gz
  • check version of java – it should be atleast java 1.5 – preferred java 1.6
    $ java -version
    java version “1.6.0”
    Java(TM) SE Runtime Environment (build 1.6.0-b105)
    Java HotSpot(TM) Server VM (build 1.6.0-b105, mixed mode)
  • Hadoop requires to ssh to the local server. So you would need to creat keys on local machine so that the ssh does not require password.
    $ ssh-keygen -t rsa
    Generating public/private rsa key pair.
    Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
    Enter passphrase (empty for no passphrase):
    Enter same passphrase again:
    Your identification has been saved in /home/hadoop/.ssh/id_rsa.
    Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
    The key fingerprint is:
    fb:7a:cf:c5:c0:ec:30:a7:f9:eb:f0:a4:8b:da:6f:88 hadoop@linuxbox
    now copy the public key to the authorized_keys file, so that ssh should not require passwords
    cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
    Now check
    $ ssh localhost
    Last login: Sat Oct 18 18:30:57 2008 from localhost
    $
  • Change environment parameters in hadoop-env.sh
    export JAVA_HOME=/path/to/jdk_home_dir
  • Change configuration parameters in hadoop-site.xml. 

In hadoop 1.1.1, hadoop-site.xml has been replaced by 3 files – core-site.xml, hdfs-site.xml and mapred-site.xml


<configuration>
in core-site.xml

<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri’s scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri’s authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>

in hdfs-site.xml

<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>

in mapred-site.xml

<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If “local”, then jobs are run in-process as a single map
and reduce task.
</description>
</property>

</configuration>

  • Format the name node
    $ cd /home/hadoop
    $ ./bin/hadoop namenode -format
    Check the output for errors.
  • Start single node cluster
    $ <HADOOP_INSTALL>/bin/start-all.sh
    This should start the namenode, datanode, jobtracker and tasktracker – all on one machine.
  • Check whether the nodes are up and running. The output should be approximately like
    $ jps
    28982 JobTracker
    28737 DataNode
    28615 NameNode
    30570 Jps
    29109 TaskTracker
    28870 SecondaryNameNode
  • In case of any error, Please check the log files in the <HADOOP_INSTALL_DIR>/logs directory.
  • To stop the node, run
    $ <HADOOP_INSTALL_DIR>/bin/stop-all.sh

We will skip running actual map-reduce tasks on a single node setup and go ahead with a multi-node setup. Once we have 2 machines up and running, we will run some example map-reduce tasks on those nodes. So, lets proceed with multi-node setup.

For multi node setup, you should have 2 machines up and running and both having hadoop – single node setup on them. We will refer the machines as master and slave, And assume that hadoop has been installed under /home/hadoop directory in both the nodes.

  • Firstly stop the single node hadoop running on them.
    $ <HADOOP_INSTALL_DIR>/bin/stop-all.sh
  • Edit /etc/hosts file on both the servers to setup master and slave names. Eg:
    aaa.bbb.ccc.ddd master
    www.xxx.yyy.zzz slave
  • Now, the master should be able to ssh to the slave server without any password, so copy the public key of master to that of slave.
    master]$ ssh-keygen -t rsa
    master ~/.ssh]$ scp id_rsa.pub hadoop@slave:.ssh/
    slave ~/.ssh]$ cat id_rsa.pub >> authorized_keys
    Test the ssh setup
    master]$ ssh master
    master]$ ssh slave
    slave ]$ ssh slave
  • Change the <HADOOP_INSTALL_DIR>/conf/masters & <HADOOP_INSTALL_DIR>/conf/slaves file to add the master & slave hosts there on the master server. The files should look like this.

    master ~/hadoop/conf]$ cat masters
    master

    The slaves file contains the hosts(one per line) where hadoop slave daemons (data nodes and task trackers) would run. In our case we are running the datanode & tasktracker on both machines. In addition the master server would also run the master related services (namenode). Both master and slave would store data.

    master ~/hadoop/conf]$ cat slaves
    master
    slave

  • Now change the configuration (<HADOOP_INSTALL_DIR>/conf/hadoop-site.xml) on all machines (master & slave). Set/change the following variables.

    Specify the host and port of the name node(master server).
    fs.default.name = hdfs://master:54310

    Specify the host and port of the job tracker (map reduce master).
    mapred.job.tracker = master:54311

    Specify the number of machines a single file should be replicated to before it becomes available. It should be equal to the number of slave nodes. In our case it is 2 (master & slave – both act as slaves as well).
    dfs.replication = 2

  • You need to format the namenode recreate the datanode. Do the following
    master ~/hadoop/hadoop-hadoop]$ rm -rf dfs mapred
    slave ~/hadoop/hadoop-hadoop]$ rm -rf dfs mapred

    Recreate/reformat the name node
    master ~/hadoop] $ ./bin/hadoop namenode -format

  • Start the cluster.
    [master ~/hadoop]$ ./bin/start-dfs.sh
    starting namenode, logging to /home/hadoop/bin/../logs/hadoop-hadoop-namenode-master.out
    master: starting datanode, logging to /home/hadoop/bin/../logs/hadoop-hadoop-datanode-master.out
    slave: starting datanode, logging to /home/hadoop/bin/../logs/hadoop-hadoop-datanode-slave.out
    master: starting secondarynamenode, logging to /home/hadoop/bin/../logs/hadoop-hadoop-secondarynamenode-master.out
    Check the processes running on the master node
    [master ~/hadoop]$ jps
    5249 SecondaryNameNode
    5319 Jps
    5117 DataNode
    4995 NameNode
    Check the processes running on slave node
    [slave ~/hadoop]$ jps
    22256 Jps
    22203 DataNode
    Check the logs on the slave for errors. <HADOOP_INSTALL_DIR>/logs/hadoop-hadoop-datanode-slave.log
  • Now start the mapreduce daemons:
    [master ~/hadoop]$ ./bin/start-mapred.sh
    starting jobtracker, logging to /home/hadoop/bin/../logs/hadoop-hadoop-jobtracker-master.out
    slave: starting tasktracker, logging to /home/hadoop/bin/../logs/hadoop-hadoop-tasktracker-slave.out
    master: starting tasktracker, logging to /home/hadoop/bin/../logs/hadoop-hadoop-tasktracker-master.out
    Check the processes on master
    [master ~/hadoop]$ jps
    5249 SecondaryNameNode
    5117 DataNode
    5725 TaskTracker
    5598 JobTracker
    5853 Jps
    4995 NameNode
    And the processes on the slave
    [slave ~/hadoop]$ jps
    22735 TaskTracker
    22856 Jps
    22413 DataNode

To shut down the hadoop cluste run the following on master

[master ~/hadoop]$ ./bin/stop-mapred.sh # to stop mapreduce daemons
[master ~/hadoop]$ ./bin/stop-dfs.sh # to stop the hdfs daemons

Now, lets populate some files on the hdfs and see if we can run some programs

Get the following files on your local filesystem in some test directory on master

[master ~/test]$ wget http://www.gutenberg.org/files/20417/20417-8.txt
[master ~/test]$ wget http://www.gutenberg.org/dirs/etext04/7ldvc10.txt
[master ~/test]$ wget http://www.gutenberg.org/files/4300/4300-8.txt
[master ~/test]$ wget http://www.gutenberg.org/dirs/etext99/advsh12.txt

Populate the files in the hdfs file system

[master ~/hadoop]$ ./bin/hadoop dfs -copyFromLocal ../test/ test

Check the files on the hdfs file system

[master ~/hadoop]$ ./bin/hadoop dfs -ls
Found 1 items
drwxr-xr-x – hadoop supergroup 0 2008-10-20 12:37 /user/hadoop/test
[master ~/hadoop]$ ./bin/hadoop dfs -ls test
Found 4 items
-rw-r–r– 2 hadoop supergroup 674425 2008-10-20 12:37 /user/hadoop/test/20417-8.txt
-rw-r–r– 2 hadoop supergroup 1573048 2008-10-20 12:37 /user/hadoop/test/4300-8.txt
-rw-r–r– 2 hadoop supergroup 1423808 2008-10-20 12:37 /user/hadoop/test/7ldvc10.txt
-rw-r–r– 2 hadoop supergroup 590093 2008-10-20 12:37 /user/hadoop/test/advsh12.txt

Now lets run some test programs. Lets run the wordcount example and collect the output in the test-op directory.

[master ~/hadoop]$ ./bin/hadoop jar hadoop-0.18.1-examples.jar wordcount test test-op
08/10/20 12:49:45 INFO mapred.FileInputFormat: Total input paths to process : 4
08/10/20 12:49:46 INFO mapred.FileInputFormat: Total input paths to process : 4
08/10/20 12:49:46 INFO mapred.JobClient: Running job: job_200810201146_0003
08/10/20 12:49:47 INFO mapred.JobClient: map 0% reduce 0%
08/10/20 12:49:52 INFO mapred.JobClient: map 50% reduce 0%
08/10/20 12:49:56 INFO mapred.JobClient: map 100% reduce 0%
08/10/20 12:50:02 INFO mapred.JobClient: map 100% reduce 16%
08/10/20 12:50:05 INFO mapred.JobClient: Job complete: job_200810201146_0003
08/10/20 12:50:05 INFO mapred.JobClient: Counters: 16
08/10/20 12:50:05 INFO mapred.JobClient: File Systems
08/10/20 12:50:05 INFO mapred.JobClient: HDFS bytes read=4261374
08/10/20 12:50:05 INFO mapred.JobClient: HDFS bytes written=949192
08/10/20 12:50:05 INFO mapred.JobClient: Local bytes read=2044286
08/10/20 12:50:05 INFO mapred.JobClient: Local bytes written=3757882
08/10/20 12:50:05 INFO mapred.JobClient: Job Counters
08/10/20 12:50:05 INFO mapred.JobClient: Launched reduce tasks=1
08/10/20 12:50:05 INFO mapred.JobClient: Launched map tasks=4
08/10/20 12:50:05 INFO mapred.JobClient: Data-local map tasks=4
08/10/20 12:50:05 INFO mapred.JobClient: Map-Reduce Framework
08/10/20 12:50:05 INFO mapred.JobClient: Reduce input groups=88307
08/10/20 12:50:05 INFO mapred.JobClient: Combine output records=205890
08/10/20 12:50:05 INFO mapred.JobClient: Map input records=90949
08/10/20 12:50:05 INFO mapred.JobClient: Reduce output records=88307
08/10/20 12:50:05 INFO mapred.JobClient: Map output bytes=7077676
08/10/20 12:50:05 INFO mapred.JobClient: Map input bytes=4261374
08/10/20 12:50:05 INFO mapred.JobClient: Combine input records=853602
08/10/20 12:50:05 INFO mapred.JobClient: Map output records=736019
08/10/20 12:50:05 INFO mapred.JobClient: Reduce input records=88307
Now, lets check the output.

[master ~/hadoop]$ ./bin/hadoop dfs -ls test-op
Found 2 items
drwxr-xr-x – hadoop supergroup 0 2008-10-20 12:45 /user/hadoop/test-op/_logs
-rw-r–r– 2 hadoop supergroup 949192 2008-10-20 12:46 /user/hadoop/test-op/part-00000
[master ~/hadoop]$ ./bin/hadoop dfs -copyToLocal test-op/part-00000 test-op-part-00000
[master ~/hadoop]$ head test-op-part-00000
“‘A 1
“‘About 1
“‘Absolute 1
“‘Ah!’ 2
“‘Ah, 2
“‘Ample.’ 1
“‘And 10
“‘Arthur!’ 1
“‘As 1
“‘At 1

That’s it… We have a live setup of hadoop running on two machines…

install latest vlc on ubuntu 8.04

So, you have got ubuntu and you are still playing movies on 0.8.6 something version. No matter how many times you do an “apt-get update”, vlc does not upgrade. The way to do this is:

a) sudo vim /etc/apt/sources.list

b) Add line “deb http://ppa.launchpad.net/c-korn/ubuntu hardy main” and save the file

c) sudo apt-get update

d) sudo apt-get install vlc

Now when you type in vlc, you will see the latest vlc player popping up…

how law and order works in india…

It had been a week since we had got vegetables from the market. For the past 3-4 days we had been surviving only on varieties made out of potatoes and onions. So, on a fine monday evening, while we were driving back to home from office, we thought that we should get down at the roadside vegetable market and get some veggies. So, we stopped on the opposite side of the road from where the market was, got down and went to get the veggies. There were lots of cars parked over there and ofcourse it is a very busy road with cars and other vehicles moving up and down. Also tons of rickshaw-walas were waiting not 100 meters from where i had my car parked.

We are really quick shoppers, so in 15 minutes we had got supplies for almost a week and we decided that we should move back. When i came back, i saw that the window of the right hand back door was not there. Well, it was there, but it was in pieces and my precious laptop bag with all its contents was missing. It took some time for the fact to sink in that my bag was missing. I looked left and right for the f*** who had stolen my bag, but of-course, no one could be found. And then i looked at people nearby. I saw a man sitting on a chair 10 meters from my car. I ran to him and asked if he has seen anything. Ofcourse, he had seen nothing. The rickshaw-walas were totally ignorant and emphatetic with my situation. “ka jamana aa gaya hai, seesha tod kar bagwa le kar chala gaya”. That is how they express their grief.

In india, when things are stolen from you, you have to give up all hope of getting it back. People generally turn blind and deaf when they feel that something wrong is happening in their surrounding. The reason behind this might be the fact that the probability of getting caught and facing a sentence is very low (maybe 0.1 %). People generally try to avoid reporting crimes, because the police are totally un-cooperative. The police try “NOT” to find lost things. I cant figure out if it is their laziness or lack of IQ. It might be both. In foreign countries, say U.S., you dial 911 and within 5-10 minutes, you have cops at your door – trying to help you out. I have not tried dialing 100 here, but i dont think that the police would be at the place in less than an hour.

Well, lets move ahead with the story. So, i called up one of my friends and told him what had happened. He told me that he would be home in an hour and then we can go and maybe report the crime – that is if it is required to claim the insurance. The main things I had lost were my official laptop, possession letter (I had specially taken it out on that day to get it photocopied), RC of my bike, bank and credit-card statements and an almost new cheque book. I went home and made a complete list of things i have lost and passwords that need changing. ( I had stored some passwords on my laptop).

An hour later, me and my friend wrote down an application in english and went to the nearby police “chowki” to get the report registered. The first reaction of the policeman sitting there was that “why was the application in english?”, so we wrote it again in Hindi. Imagine the policemen unable to read english. What would happen if a foreign tourist gets robbed? Would he be able to even make these guys understand what has happened? When we handed over the freshly written application to him, he simply put it on the table and went out to get his superior officer. The superior officer read the application and asked us to take him to the place where the incident has happened. I had used my brain a little and had safely put away my car and taken my bike. Mainly because i was not sure if they would ask me to let the car be with them or they would ask more money looking at a long car and thinking that i am a “rich” man.

So, we rode after them on the motorcycle and went to the vegetable market. I showed him the place where i had parked the car from a distance – i dont know why he did not go to the place. And told him that there were rickshaws and a man sitting nearby, he was quick to ask whether i suspect the man to be the thief. How could i judge? Should he not question the person and check out whether he was right or wrong? Anyways, next he started blaming me for parking my car on the road where there was no parking. But, there is no parking space nearby – i thought. And we came back to his “chowki”. He kept the application and told us to check on the status next morning. We asked him if he would give us an FIR. But the reply was “No” – “mai FIR nahi likhta”.

I knew this would happen and it would be difficult to get an FIR our of these guys without giving them some “donation” in return. But i had expected them to ask upfront for a “donation”. But no such thing happened, so we waited. We waited for almost 15 minutes, but nothing happened. Then the “daroga” got angry because we were waiting for his response and said “Ab laptop le kar hi jaogay kya? Dekh lo kahi yehi pada ho to le jana.”. We did not know what to reply, so we simply went home.

Next day, i went to the police station with 2 of my friends and again repeated the complete story to atleast 2 police officers. One of them told us to get our car to show the damage. So, we went and got the car. He then sent some junior havaldar to check out whether the window was broken (that is whether we were telling the truth or not). Then we were asked to consult the SO (head in charge of the police station). Again we repeated the story to him. And showed him the car from a distance. So, he told his junior officers to accept the “application” [Still no FIR]. The junior officer simply accepted the application and stamped it and drew a vertical line (which i believe was his signature). He did not read the contents, neither did he check what language it was written in.

After accepting the application, when we asked for a complaint no, he stopped looking at us and started shifting papers from one box to the same box. I think, he was trying to ignore us – i dont know why? So, we again went to the SO and told him that we wanted some complaint no. And he redirected us to another officer, to whom we again repeated the whole story and showed him the application. He simply said “yeah to angreji mein hai. Isko received kisne kia?” (this is in english, how did this get accepted?) – how was i supposed to know the answer to this question? He took us back to the previous junior officer who had accepted and asked him why did he accept an application in english?. Well, the application was returned back to us and we were asked to rewrite it in hindi and submit it again.

So, we wrote it in hindi and submitted it again. This time the junior officer read it and then stamped it. It was during that time that we came to know that the junior officer was an 8th pass and did not know anything about english. If this is the type of education that a policeman has, then what can we expect out of them. We asked almost everyone about when can we get an FIR or a complaint no, and the response was “kal” (tomorrow). Someone even said that we might get in 2-3 hours, if we are ready to wait.

When i came to the office that day, lots of people came to me and shared their experiences when they had to get an FIR. Some of them had spent months to get an FIR. One very sad case had spent 200/- for the FIR and even after that, there was some mistake in the writing and so he had not yet been able to claim his insurance.

For the next 2 days, i just went and asked whether the FIR was ready and the general response was “kal” – mainly due to work pressure. Well, if the police are so busy writing FIR’s who would do the investigation and catch the criminals. I think, they have a tough job to do – trying to write down so many FIR’s instead of catching the criminals and reducing the crime rate. Everybody advised me to pay them to get the FIR. But when i offered them, they would say that it is unnecessary and still make me come the next day.

Finally on the 3rd day, dad was here, so he went with me and talked them into writing the FIR. We were again sent to the SO to whom we again repeated the complete story and reminded him about the talk we have had earlier. He again read the application that we had submitted and asked us to get a photocopy of the bill of the laptop. We rushed to the office and got a photocopy of the bill. This time the SO was generous and told us to get the FIR written. When we again went to the junior officer, he asked us to wait and then confirmed from the SO whether he should write the FIR. Again we were told to come after 2 hours. This time dad made an indication of the offer and i think the junior officer got it.

We were again asked to write a fresh application and change the dates accordingly. After some pestering the junior officer finally started writing the FIR. Finally after around 30 minutes the FIR was ready and we got a copy. After i came out, i asked dad whether he had given them the “donation” – because i did not see him giving it to them. Then dad told me that it was given to them when he shook hands with them for the final “thankyou”.

The point here is that after being a victim to a crime, you have to be after the police to prove that you are a victim. Forget the option of getting back your stolen stuff. You have to give them some money to get the complaint registered. It is like “Please sir, (beg with folded hands – a 100/- rs note between the hands), i have suffered a great loss, please write my complaint”. How can you expect justice to come out of these guys. I still remember the detective serials that I used to watch in my childhood. The actual scenario is worse than that. Policemen not only overlook valuable clues but dont want to find the criminal. If your car is picked up and left in the police station for a week, all you would get back would be the outer body and the seats. The steering, engine etc all would go missing. And if you enquire about it, all they would say is that it was brought here like that only.

The SO had his own style (tashan). He would keep on chewing paan, etc and keep on spitting out the reddish stuff. Specially for him there was a huge bucket to his left side, so that he could keep on spitting until the bucket is full and then it would be taken away.

Welcome to India…

road rage

I have driven for hours in delhi. I have driven from noida to gurgaon in very heavy traffic which took me 3 hours to reach. Sometimes, it really becomes troublesome when you have a nature’s call and you cannot do anything about it. At most of the places where jams are frequent, you could see vendors selling water and other stuff to eat. But there are no public toilets nearby.

Now-a-days, all cycle walas and rickshaw walas think that they are shahrukh khan. They ride in the middle of the road and try to over take your car, cutting in front of you – as if they are invincible. People on bike also believe that they are GOD. They think, they can jump red lights without getting into any accidents and ride in the middle of the road at a constant speed of 30 kmph – turning a deaf ear to your honking and not allowing you to overtake. And the best part is when you see circus like acts on the road with 3 people on a cycle or 3 people on a small bike. I sometimes feel mercy for the overloaded vehicles. And even think whether the people who had designed the engine of the bike had taken such situations into consideration.

This attitude does lead to some frustration. After driving for an hour to cover a 20 minute distance, trying to protect these people inspite of their own carelessness, you tend to get exhausted. And you also have to protect yourself from the buses which keep on zooming from
left and right both in wrong direction and right direction. What amazes me about these busses is how they drive by you with only 2 inches to spare between your vehicle and the bus, at full speed and still it misses you. Arent the bus drivers extremely talented.

And finally cows & dogs on the road. They would just stand there confused about where to go and what to do. The municipal corporation does not try to remove them from the road. Who cares, people would simply go around them.

And ofcourse, the most famous of all these are the tampos. Yup the green colored tampos which have CNG written on their back and still throw out tons and tons of black smoke. They are heavily overloaded and keep on ferrying people between points A & B. They dont go beyond 20 kmph and they always move in the middle of the road, so that all the traffic stay behind them. They always stop in the middle of the road and just at a crossing, so that they could create jams and make people acknowledge their importance by their honking.

When people encounter these situations, and are unable to arrive at a compromise, they simply blow up. Why should they care for people who do not care for themselves?

If i would have had a jeannie, i would have wished that it taught the people on the road to have more road sense.