- Back to Home »
- Hadoop , MapReduce , Virtual Box , WordCount »
- Apache Hadoop in VirtualBox
Posted by :
Ridvan Döngelci
Friday, October 11, 2013
Apache Hadoop is framework for processing large amount of data in parallel fashion. Hadoop framework heavily relies onto map and reduce functions of functional programming languages. A user only defines the map and reduce functions, all other operations like distributing data and work over network, re-running failed jobs and collecting results is handled automatically via Hadoop.
Hadoop first applies the user defined map function to key-value pairs, and results of mappings are sorted abd distributed over the nodes according to their key values. Each node applies user defined reduce function to each key-value pair and commonly writes results to a file on Hadoop Distributed File System (HDFS).
Setting Hadoop onto Virtual Machine
Setup uses VirtualBox, Hadoop 1.2.1, Ubuntu 12.04.3 Server.
Virtual Box and Ubuntu Setup
1. Download Virtual Box 4.2.18 or later version which suits your operation system.
2. Install Virtual Box with the desired setting.
3. Download Ubuntu 12.04.3 Server edition image to be guest operation system to run Hadoop.
4. Open Virtual Box and click 'New' option to create a new virtual machine.
5. Choose 'Type' as 'Linux' and 'Version' as 'Ubuntu' as shown in picture below.
6. Proceed with selecting desired options like RAM size and Disk Size.
7. Select the virtual machine just created and click 'Start' option.
8. Virtual Box would ask a disk image to boot select the Ubuntu 12.04.3 Server edition image, you downloaded at step 3. If 'FATAL: No bootable medium found! System halted.' message occurred click 'Devices' menu -> 'CD/DVD Devices' -> 'Choose a virtual CD/DVD disk file...' option then browse and select Ubuntu image you downloaded. After selecting Ubuntu image click 'Machine' and then 'Reset'.
9. Setup Ubuntu in Virtual Machine with desired options.
Setup Hadoop
10. Set port forwarding for virtual machine as shown here:
11. After installing virtual machine, run virtual machine and login with username and password you picked in the setup. Go to home with "cd ~/" command.
12. Install ssh client and server for connecting virtual machine with ssh and scp with following commands.
After this step you may connect virtual machine with ssh client on the host by using "ssh -P 2222 localhost".
12. Install ssh client and server for connecting virtual machine with ssh and scp with following commands.
After this step you may connect virtual machine with ssh client on the host by using "ssh -P 2222 localhost".
# Install ssh client and server
sudo apt-get install ssh
sudo apt-get install openssh-server
13. Add ssh keys as trusted as follows.
# add ssh key to trusted servers
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
14. Install Java Runtime Environment and Java Developer Kit on Ubuntu Server by running the following commands.sudo apt-get install ssh
sudo apt-get install openssh-server
13. Add ssh keys as trusted as follows.
# add ssh key to trusted servers
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
# install jre and jdk (java)
cd ~/
wget https://github.com/flexiondotorg/oab-java6/raw/0.3.0/oab-java.sh -O oab-java.sh
chmod +x oab-java.sh
sudo ./oab-java.sh
sudo apt-get install sun-java6-jre
sudo apt-get install sun-java6-jdk
rm oab-java.sh
rm -f oab-java.sh.log
15. Rsync is necessary for Hadoop if you dont have it in Ubuntu setup run the following command.
# rsync is necessary for hadoop
sudo apt-get install rsync
16. Install Hadoop with using following commands.
# install hadoop
cd ~/
wget http://www.nic.funet.fi/pub/mirrors/apache.org/hadoop/common/hadoop-1.2.1/hadoop-1.2.1.tar.gz
tar -xvf hadoop-1.2.1.tar.gz
rm hadoop-1.2.1.tar.gz
17. Next you need to configure Hadoop for pseudo distributed mode. You may run following commands or change files with a text editor to contain following content.
#configure hadoop to psuedo distirbuted
mv ~/hadoop-1.2.1/conf/core-site.xml ~/hadoop-1.2.1/conf/core-site-backup.xml
echo "<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>" > ~/hadoop-1.2.1/conf/core-site.xml
mv ~/hadoop-1.2.1/conf/hdfs-site.xml ~/hadoop-1.2.1/conf/hdfs-site-backup.xml
echo "<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>" > ~/hadoop-1.2.1/conf/hdfs-site.xml
mv ~/hadoop-1.2.1/conf/mapred-site.xml ~/hadoop-1.2.1/conf/mapred-site-backup.xml
echo "<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>" > ~/hadoop-1.2.1/conf/mapred-site.xml
echo 'export JAVA_HOME=/usr/lib/jvm/java-6-sun' >> ~/hadoop-1.2.1/conf/hadoop-env.sh
18. Set environment variables necessary for Hadoop setup. You may run the commands or just export variables but in that case it will be only valid for the session you are using.
#set java home and hadoop home in bash
echo 'export JAVA_HOME=/usr/lib/jvm/java-6-sun' >> ~/.bashrc
echo 'export HADOOP_HOME=/~/hadoop-1.2.1/bin' >> ~/.bashrc
echo 'export PATH=~/hadoop-1.2.1/bin:$PATH' >> ~/.bashrc
source ~/.bashrc
19. Now you may start and stop Hadoop by using following scripts respectively.
start-all.sh
stop-all.sh
Alternatively you may download and run the following script on your Ubuntu setup to perform all operations listed above.
cd ~/
wget http://www.cs.hut.fi/~dongelr1/hadoopscript.sh
chmod +x hadoopscript.sh
./hadoopscript.sh
WordCount Example
20. Run the following commands to load WordCount example. If you have run hadoopscript.sh you may skip this step.
#get scripts and wordcount example
cd ~/
wget http://www.cs.hut.fi/~dongelr1/WordCount.tar
tar -xvf WordCount.tar
rm WordCount.tar
cd WordCount
chmod +x start_commands
chmod +x run_commands
21. Run start_commands to start Hadoop as follows.
cd ~/WordCount
./start_commands
22. Check http://localhost:50070/dfshealth.jsp to see if there is any data node live. If there is no alive node then run "stop-all.sh" and remove everything in tmp by "rm -r /tmp/*" and redo the step 21.
23. You may run "run_commands" to run WordCount example and you can examine the commands with "cat run_commands". run_commands script basically makes an input directory in Hadoop Distributed File System and puts file1.txt and file2.txt to that direactory. Afterwards, compiles WordCount.java into wordcount_classes folder and creates a wordcount.jar file from compilation. Then executes a Hadoop job with wordcount.jar to count input/ directory and output it to output/ directory.
cd ~/WordCount
./run_commands