Implementing Tools Interface in MapReduce

I was banging around with MapReduce the other day and web surfing. I came across this post on implementing the Tools interface in your MapReduce driver program. Most of the first level examples show a static main method which as the author describes doesn’t allow you to use the configuration dynamically (i.e., you cannot use -D at the command line to pass options to the configuration object). For fun I took Word Count and refactored it using this suggestion. I thought it might be good to share this with folks. I have posted the full code to github and display it below as well.

Using this method you can now pass options to the configuration option via the command line using -D. This is a handy addition to any MapReduce program.

Creating a Multinode Hadoop Sandbox

One of the great things about all the Hadoop vendors is that they have made it very easy for people to obtain and start using their technology rapidly. I will say that I think Cloudera has done the best job by providing cloud based access to their distribution via Cloudera Live. All vendors seems to have a VMware and VirtualBox based sandbox/trial image. Having worked with Hortonworks I have the most experience with and thought a quick initial blog post would be helpful.

While one could simply do this installation from scratch following the package instructions, it’s also possible to short-circuit much of the setup as well as take advantage of the scaled down configuration work already put into the virtual machine provided by Hortonworks. In short the idea would be to use the VM as a single master node and simply add data nodes to this master. Running this way provides and easy way to install and expand an initial Hadoop system up to about 10 nodes. As the system grows you will need to add RAM to not only the virtual host but to Hadoop Daemons as it scales. A full script is available here. Below is a description of the process.

The general steps include:

1. The Sandbox 

Download and install the Hortonworks Sandbox as your head node in your virtualization system of choice. The sandbox tends to be produced prior to the latest major release (compare yum list hadoop *\ output). Make sure you have first enabled Ambari by running the script in root’s home directory and reboot.

In order to make sure you are using the very latest stable release and that the Ambari server and agent daemons have matching versions upgrading is easiest. This includes following:

2. The Nodes  

Install 1-N Centos 6.5 nodes as slaves and prep them as worker nodes. These can be default installs of the OS but need to be on the same network as the Ambari server. This can also be facilitated via pdsh (but this requires passwordless ssh) OR better yet simply creating one “data node” image via a PXE boot environment or snapshot of the Virtual machine to quickly replicate 1-N nodes with these changes.

If you want to use SSH you can do this from the head node to quickly enable passwordless SSH:

You then want to make sure you make the following changes to your slave nodes. Again this could easily be done via pdsh by pcdp the a script to each node and executing with the following content.

Push this file to slave nodes and run it. This does NOT need to be done on the sandbox/headnode.

3. Configure Services Run the Ambari “add nodes” GUI installer to add data nodes. Be sure to select “manual registration” and follow the on-screen prompts to install components. I recommend installing everything on all nodes and simply turning the services off and on as needed. Also installing the client binaries on all nodes helps to make sure you can do debugging from any node in the cluster.

Ambari Add Nodes Dialog

4. Turn off select services as required. 

There should now be 1-N data nodes/slaves attached to your Ambari/Sandbox head node. Here are some suggested changes.  
1. Turn off large services you aren’t using like HBase, Storm, Falcon. This will help save RAM.  
Ambari Services
2. Decommission the Data node on this machine! No! a head node is not a datanode. If you run jobs here you will have problems.  
3. HDFS Replication factor – This is set to 1 in the sandbox because there is only one datanode. If you only have 1-3 datanodes then triple replication doesn’t make sense. I suggest you use 1 until you get over 3 data nodes at a bare minimum. If you have the resources just start with 10 data nodes (that’s why it’s called Big Data). If not stick with replication factor of 1 but be aware this will function as a prototype system and wont provide the natural safeguards or parallelism of normal HDFS.  
4. Increase RAM to Head node – At a bare minimum Ambari requires 4096MB. If you plan to run the sandbox as a head node consider increasing from this minimum. Also consider giving running services room to breath by increasing the RAM allocated in Ambari for each service. Here is a great review and script for guestimating how to scale services for MapReduce and Yarn.  
5. NFS – to make your life easier you might want to enable NFS on a data node or two.

Creating a Virtualized Hadoop Lab

Over the past few years I have let my home lab dwindle a little. I have been very busy at work and for the most part I was able to prototype what I needed for work on my laptop given the generous amount of RAM on the MacBook Pro. That said I was still not able to have the type of permanent setup I really wanted. I know lots of guys who go to the trouble of setting up racks to create their own clusters at home. Given that I really only need a functional lab environment and don’t want to waste the power, cooling or space in my home I turned to virtualization. While I would be the first one in the room to start babbling on about how Hadoop is not supposed to be virtualized in production it is appropriate for development. I wanted a place to test and use a variety of Hadoop virtual machines:

Vendor Distro URL
Hortonworks Hortonworks Data Platform http://hortonworks.com/products/hortonworks-sandbox/
Cloudera Quick Start VMs http://go.cloudera.com/vm-download
MapR MapR Distribution for Apache™ Hadoop® https://www.mapr.com/products/mapr-sandbox-hadoop
Oracle Big Data Lite link
IBM Big Insights http://www-01.ibm.com/software/data/infosphere/biginsights/quick-start/downloads.html

 

* If you are feeling froggy here is a full list of Hadoop Vendors.

So I dusted off an old workstation I had in my attic from a couple of years ago. This is a Dell Precision T3400 workstation that I used a few moons ago for the same reason. A couple of years ago to run a handful of minimal Linux instances this system was fine. To make is useful today it needed some work. I obviously had to upgrade the Ubuntu to 14.04 as it was still some version in the 12 range. I wont bother with the details of these gymnastics as I believe the Ubuntu community has this covered.

While I did take a look at VirtualBox and VMware Player I think I wanted to use something open source but also sans GUI. I realize there are command line options for both VirtualBox and VMware but in the end using QEMU/ KVM with libvirt fit the bill as the most open source and command line way to go. For those new to virtualization and in need of a GUI one of the other solutions might be a better fit for you. Its left as an exercise for the reader to get QEMU and libvirt installed on your OS. An important point I worked through was creating a bridged adapter on the host machine. I only have one installed network card and wanted my hosted machines on my internal network. In short you are creating a network adapter that the virtualization system can use on top of a single physical adapter. The server can still use the regular IP of the original adapter but now virtual host can act as if they are fully on the local network. Since this system wont be leaving my lab this a perfect solution. If you want something mobile on your laptop such as you should consider an internal or host only network setup. Make sure you reboot after changing the following.

cat /etc/network/interfaces
auto lo
iface lo inet loopback

auto br0
iface br0 inet static
address 192.168.55.3
netmask 255.255.255.0
network 192.168.55.0
broadcast 192.168.55.255
gateway 192.168.55.1
dns-nameservers 192.168.55.1
bridge_ports eth0
bridge_fd 9
bridge_hello 2
bridge_maxage 12
bridge_stp off

Although QEMU supports a variety of disk formats natively I decided to convert the images I collected for my Hadoop play ground into qcow2 the native format for QEMU. I collected a variety of “sandbox” images from a number of Hadoop and Big Data vendors. Most come in OVA format which is really just a tarball of the vmdk file and ovf file describing the disk image. To convert you simply extract the vmdk file: 

tar -xvf /path/to/file/Hortonworks_Sandbox_2.1_vmware.ova

and convert the resulting vmdk file:

qemu-img convert -O qcow2 Hortonworks_Sandbox_2.1-disk1.vmdk /path/to/Hortonworks_Sandbox_2.1-disk1.qcow2

Have more than 1 vmdk like the MapR sandbox? No problem: 

qemu-img convert -O qcow2 ./MapR-Sandbox-For-Hadoop-3.1.1_VM-disk1.vmdk ./MapR-Sandbox-For-Hadoop-3.1.1_VM-disk2.vmdk ../MapR-Sandbox-For-Hadoop-3.1.1_VM.qcow2

 

The use of the images is quick and easy:

virt-install --connect qemu:///system --ram 1024 -n HWXSandbox21 -r 2048 --os-type=linux --os-variant=rhel6 --disk path=/home/user/virtual_machines/Hortonworks_Sandbox_2.1-disk1-VMware.qcow2,device=disk,bus=virtio,format=qcow2 --vcpus=2 --vnc --noautoconsole --import

If you were to go the GUI route one could use virt-manager at this point to get a console and manage the machine. Of course in the interest of saving RAM and pure stubbornness I use the command line. First find the list of installed systems and then open a console to that instance. 

virsh list --all
virsh console guestvmname

 

While this will get you a console, you might not see all the console output you want to when using a monitor. For CentOS you need to create a serial interface in the OS (ttyS0) and instruct the OS to use that new interface. From this point you one should able to log in, find the IP address and be off to the races. With the use of the new serial interface you will see the normal boot up action if you reboot.

The real saving here is memory. Turning of Xserver and all the unnecessary OS services saves memory for running the various sandboxes. This should allow you to use Linux and a machine with 8 to 16GB of RAM effectively for development.

The next step will be to automate the installation base operating systems via PXE boot environment followed by installation of a true multinode virtualized Hadoop cluster. That I will leave for another post.