Apache Hadoop in VirtualBox

Posted by : Ridvan Döngelci Friday, October 11, 2013

Apache Hadoop is framework for processing large amount of data in parallel fashion. Hadoop framework heavily relies onto map and reduce functions of functional programming languages. A user only defines the map and reduce functions, all other operations like distributing data and work over network, re-running failed jobs and collecting results is handled automatically via Hadoop.

Hadoop first applies the user defined map function to key-value pairs, and results of mappings are sorted abd distributed over the nodes according to their key values. Each node applies user defined reduce function to each key-value pair and commonly writes results to a file on Hadoop Distributed File System (HDFS).

Setting Hadoop onto Virtual Machine

Setup uses VirtualBox, Hadoop 1.2.1, Ubuntu 12.04.3 Server.

Virtual Box and Ubuntu Setup

1. Download Virtual Box 4.2.18 or later version which suits your operation system.

From: https://www.virtualbox.org/wiki/Downloads

2. Install Virtual Box with the desired setting.

3. Download Ubuntu 12.04.3 Server edition image to be guest operation system to run Hadoop.

From: http://www.ubuntu.com/download/server

4. Open Virtual Box and click 'New' option to create a new virtual machine.

5. Choose 'Type' as 'Linux' and 'Version' as 'Ubuntu' as shown in picture below.

6. Proceed with selecting desired options like RAM size and Disk Size.

7. Select the virtual machine just created and click 'Start' option.

8. Virtual Box would ask a disk image to boot select the Ubuntu 12.04.3 Server edition image, you downloaded at step 3. If 'FATAL: No bootable medium found! System halted.' message occurred click 'Devices' menu -> 'CD/DVD Devices' -> 'Choose a virtual CD/DVD disk file...' option then browse and select Ubuntu image you downloaded. After selecting Ubuntu image click 'Machine' and then 'Reset'.

9. Setup Ubuntu in Virtual Machine with desired options.

Setup Hadoop

10. Set port forwarding for virtual machine as shown here:

http://tricksandsnippets.blogspot.fi/2013/10/virtual-box-port-forwarding-for-hadoop.html

11. After installing virtual machine, run virtual machine and login with username and password you picked in the setup. Go to home with "cd ~/" command.

12. Install ssh client and server for connecting virtual machine with ssh and scp with following commands.
After this step you may connect virtual machine with ssh client on the host by using "ssh -P 2222 localhost".

# Install ssh client and server
sudo apt-get install ssh
sudo apt-get install openssh-server

13. Add ssh keys as trusted as follows.

# add ssh key to trusted servers
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa

cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

14. Install Java Runtime Environment and Java Developer Kit on Ubuntu Server by running the following commands.

# install jre and jdk (java)
cd ~/
wget https://github.com/flexiondotorg/oab-java6/raw/0.3.0/oab-java.sh -O oab-java.sh
chmod +x oab-java.sh
sudo ./oab-java.sh
sudo apt-get install sun-java6-jre
sudo apt-get install sun-java6-jdk
rm oab-java.sh
rm -f oab-java.sh.log

15. Rsync is necessary for Hadoop if you dont have it in Ubuntu setup run the following command.

# rsync is necessary for hadoop
sudo apt-get install rsync

16. Install Hadoop with using following commands.

# install hadoop
cd ~/
wget http://www.nic.funet.fi/pub/mirrors/apache.org/hadoop/common/hadoop-1.2.1/hadoop-1.2.1.tar.gz
tar -xvf hadoop-1.2.1.tar.gz
rm hadoop-1.2.1.tar.gz

17. Next you need to configure Hadoop for pseudo distributed mode. You may run following commands or change files with a text editor to contain following content.

#configure hadoop to psuedo distirbuted
mv ~/hadoop-1.2.1/conf/core-site.xml ~/hadoop-1.2.1/conf/core-site-backup.xml
echo "<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>" > ~/hadoop-1.2.1/conf/core-site.xml

mv ~/hadoop-1.2.1/conf/hdfs-site.xml ~/hadoop-1.2.1/conf/hdfs-site-backup.xml
echo "<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>" > ~/hadoop-1.2.1/conf/hdfs-site.xml

mv ~/hadoop-1.2.1/conf/mapred-site.xml ~/hadoop-1.2.1/conf/mapred-site-backup.xml
echo "<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>" > ~/hadoop-1.2.1/conf/mapred-site.xml

echo 'export JAVA_HOME=/usr/lib/jvm/java-6-sun' >> ~/hadoop-1.2.1/conf/hadoop-env.sh

18. Set environment variables necessary for Hadoop setup. You may run the commands or just export variables but in that case it will be only valid for the session you are using.

#set java home and hadoop home in bash
echo 'export JAVA_HOME=/usr/lib/jvm/java-6-sun' >> ~/.bashrc
echo 'export HADOOP_HOME=/~/hadoop-1.2.1/bin' >> ~/.bashrc
echo 'export PATH=~/hadoop-1.2.1/bin:$PATH' >> ~/.bashrc

source ~/.bashrc

19. Now you may start and stop Hadoop by using following scripts respectively.

start-all.sh
stop-all.sh

Alternatively you may download and run the following script on your Ubuntu setup to perform all operations listed above.

cd ~/
wget http://www.cs.hut.fi/~dongelr1/hadoopscript.sh
chmod +x hadoopscript.sh
./hadoopscript.sh

WordCount Example

20. Run the following commands to load WordCount example. If you have run hadoopscript.sh you may skip this step.

#get scripts and wordcount example

cd ~/

wget http://www.cs.hut.fi/~dongelr1/WordCount.tar

tar -xvf WordCount.tar

rm WordCount.tar

cd WordCount

chmod +x start_commands

chmod +x run_commands

21. Run start_commands to start Hadoop as follows.

cd ~/WordCount

./start_commands

22. Check http://localhost:50070/dfshealth.jsp to see if there is any data node live. If there is no alive node then run "stop-all.sh" and remove everything in tmp by "rm -r /tmp/*" and redo the step 21.

23. You may run "run_commands" to run WordCount example and you can examine the commands with "cat run_commands". run_commands script basically makes an input directory in Hadoop Distributed File System and puts file1.txt and file2.txt to that direactory. Afterwards, compiles WordCount.java into wordcount_classes folder and creates a wordcount.jar file from compilation. Then executes a Hadoop job with wordcount.jar to count input/ directory and output it to output/ directory.

cd ~/WordCount

./run_commands