Creating a Virtualized Hadoop Lab

Over the past few years I have let my home lab dwindle a little. I have been very busy at work and for the most part I was able to prototype what I needed for work on my laptop given the generous amount of RAM on the MacBook Pro. That said I was still not able to have the type of permanent setup I really wanted. I know lots of guys who go to the trouble of setting up racks to create their own clusters at home. Given that I really only need a functional lab environment and don’t want to waste the power, cooling or space in my home I turned to virtualization. While I would be the first one in the room to start babbling on about how Hadoop is not supposed to be virtualized in production it is appropriate for development. I wanted a place to test and use a variety of Hadoop virtual machines:

Vendor Distro URL
Hortonworks Hortonworks Data Platform
Cloudera Quick Start VMs
MapR MapR Distribution for Apache™ Hadoop®
Oracle Big Data Lite link
IBM Big Insights


* If you are feeling froggy here is a full list of Hadoop Vendors.

So I dusted off an old workstation I had in my attic from a couple of years ago. This is a Dell Precision T3400 workstation that I used a few moons ago for the same reason. A couple of years ago to run a handful of minimal Linux instances this system was fine. To make is useful today it needed some work. I obviously had to upgrade the Ubuntu to 14.04 as it was still some version in the 12 range. I wont bother with the details of these gymnastics as I believe the Ubuntu community has this covered.

While I did take a look at VirtualBox and VMware Player I think I wanted to use something open source but also sans GUI. I realize there are command line options for both VirtualBox and VMware but in the end using QEMU/ KVM with libvirt fit the bill as the most open source and command line way to go. For those new to virtualization and in need of a GUI one of the other solutions might be a better fit for you. Its left as an exercise for the reader to get QEMU and libvirt installed on your OS. An important point I worked through was creating a bridged adapter on the host machine. I only have one installed network card and wanted my hosted machines on my internal network. In short you are creating a network adapter that the virtualization system can use on top of a single physical adapter. The server can still use the regular IP of the original adapter but now virtual host can act as if they are fully on the local network. Since this system wont be leaving my lab this a perfect solution. If you want something mobile on your laptop such as you should consider an internal or host only network setup. Make sure you reboot after changing the following.

cat /etc/network/interfaces
auto lo
iface lo inet loopback

auto br0
iface br0 inet static
bridge_ports eth0
bridge_fd 9
bridge_hello 2
bridge_maxage 12
bridge_stp off

Although QEMU supports a variety of disk formats natively I decided to convert the images I collected for my Hadoop play ground into qcow2 the native format for QEMU. I collected a variety of “sandbox” images from a number of Hadoop and Big Data vendors. Most come in OVA format which is really just a tarball of the vmdk file and ovf file describing the disk image. To convert you simply extract the vmdk file: 

tar -xvf /path/to/file/Hortonworks_Sandbox_2.1_vmware.ova

and convert the resulting vmdk file:

qemu-img convert -O qcow2 Hortonworks_Sandbox_2.1-disk1.vmdk /path/to/Hortonworks_Sandbox_2.1-disk1.qcow2

Have more than 1 vmdk like the MapR sandbox? No problem: 

qemu-img convert -O qcow2 ./MapR-Sandbox-For-Hadoop-3.1.1_VM-disk1.vmdk ./MapR-Sandbox-For-Hadoop-3.1.1_VM-disk2.vmdk ../MapR-Sandbox-For-Hadoop-3.1.1_VM.qcow2


The use of the images is quick and easy:

virt-install --connect qemu:///system --ram 1024 -n HWXSandbox21 -r 2048 --os-type=linux --os-variant=rhel6 --disk path=/home/user/virtual_machines/Hortonworks_Sandbox_2.1-disk1-VMware.qcow2,device=disk,bus=virtio,format=qcow2 --vcpus=2 --vnc --noautoconsole --import

If you were to go the GUI route one could use virt-manager at this point to get a console and manage the machine. Of course in the interest of saving RAM and pure stubbornness I use the command line. First find the list of installed systems and then open a console to that instance. 

virsh list --all
virsh console guestvmname


While this will get you a console, you might not see all the console output you want to when using a monitor. For CentOS you need to create a serial interface in the OS (ttyS0) and instruct the OS to use that new interface. From this point you one should able to log in, find the IP address and be off to the races. With the use of the new serial interface you will see the normal boot up action if you reboot.

The real saving here is memory. Turning of Xserver and all the unnecessary OS services saves memory for running the various sandboxes. This should allow you to use Linux and a machine with 8 to 16GB of RAM effectively for development.

The next step will be to automate the installation base operating systems via PXE boot environment followed by installation of a true multinode virtualized Hadoop cluster. That I will leave for another post.

Look Ma…Im Famous, Im in Linux Journal.

But seriously it is kind of cool that I finally made it into Linux Journal. It has been one of my long time reads and subscriptions. I have learned lots of cool tech from what I consider the original magazine in Linux. Anywho take a look at my article on Yarn in the latest edition of Linux Journal (the HPC edition too!). It’s about YARN and understanding the difference in HPC scheduling systems compared to YARN.

Screen Shot 2014-04-02 at 7.52.49 PM