Creating a Multinode Hadoop Sandbox

One of the great things about all the Hadoop vendors is that they have made it very easy for people to obtain and start using their technology rapidly. I will say that I think Cloudera has done the best job by providing cloud based access to their distribution via Cloudera Live. All vendors seems to have a VMware and VirtualBox based sandbox/trial image. Having worked with Hortonworks I have the most experience with and thought a quick initial blog post would be helpful.

While one could simply do this installation from scratch following the package instructions, it’s also possible to short-circuit much of the setup as well as take advantage of the scaled down configuration work already put into the virtual machine provided by Hortonworks. In short the idea would be to use the VM as a single master node and simply add data nodes to this master. Running this way provides and easy way to install and expand an initial Hadoop system up to about 10 nodes. As the system grows you will need to add RAM to not only the virtual host but to Hadoop Daemons as it scales. A full script is available here. Below is a description of the process.

The general steps include:

1. The Sandbox 

Download and install the Hortonworks Sandbox as your head node in your virtualization system of choice. The sandbox tends to be produced prior to the latest major release (compare yum list hadoop *\ output). Make sure you have first enabled Ambari by running the script in root’s home directory and reboot.

In order to make sure you are using the very latest stable release and that the Ambari server and agent daemons have matching versions upgrading is easiest. This includes following:

2. The Nodes  

Install 1-N Centos 6.5 nodes as slaves and prep them as worker nodes. These can be default installs of the OS but need to be on the same network as the Ambari server. This can also be facilitated via pdsh (but this requires passwordless ssh) OR better yet simply creating one “data node” image via a PXE boot environment or snapshot of the Virtual machine to quickly replicate 1-N nodes with these changes.

If you want to use SSH you can do this from the head node to quickly enable passwordless SSH:

You then want to make sure you make the following changes to your slave nodes. Again this could easily be done via pdsh by pcdp the a script to each node and executing with the following content.

Push this file to slave nodes and run it. This does NOT need to be done on the sandbox/headnode.

3. Configure Services Run the Ambari “add nodes” GUI installer to add data nodes. Be sure to select “manual registration” and follow the on-screen prompts to install components. I recommend installing everything on all nodes and simply turning the services off and on as needed. Also installing the client binaries on all nodes helps to make sure you can do debugging from any node in the cluster.

Ambari Add Nodes Dialog

4. Turn off select services as required. 

There should now be 1-N data nodes/slaves attached to your Ambari/Sandbox head node. Here are some suggested changes.  
1. Turn off large services you aren’t using like HBase, Storm, Falcon. This will help save RAM.  
Ambari Services
2. Decommission the Data node on this machine! No! a head node is not a datanode. If you run jobs here you will have problems.  
3. HDFS Replication factor – This is set to 1 in the sandbox because there is only one datanode. If you only have 1-3 datanodes then triple replication doesn’t make sense. I suggest you use 1 until you get over 3 data nodes at a bare minimum. If you have the resources just start with 10 data nodes (that’s why it’s called Big Data). If not stick with replication factor of 1 but be aware this will function as a prototype system and wont provide the natural safeguards or parallelism of normal HDFS.  
4. Increase RAM to Head node – At a bare minimum Ambari requires 4096MB. If you plan to run the sandbox as a head node consider increasing from this minimum. Also consider giving running services room to breath by increasing the RAM allocated in Ambari for each service. Here is a great review and script for guestimating how to scale services for MapReduce and Yarn.  
5. NFS – to make your life easier you might want to enable NFS on a data node or two.

Creating a Virtualized Hadoop Lab

Over the past few years I have let my home lab dwindle a little. I have been very busy at work and for the most part I was able to prototype what I needed for work on my laptop given the generous amount of RAM on the MacBook Pro. That said I was still not able to have the type of permanent setup I really wanted. I know lots of guys who go to the trouble of setting up racks to create their own clusters at home. Given that I really only need a functional lab environment and don’t want to waste the power, cooling or space in my home I turned to virtualization. While I would be the first one in the room to start babbling on about how Hadoop is not supposed to be virtualized in production it is appropriate for development. I wanted a place to test and use a variety of Hadoop virtual machines:

Vendor Distro URL
Hortonworks Hortonworks Data Platform
Cloudera Quick Start VMs
MapR MapR Distribution for Apache™ Hadoop®
Oracle Big Data Lite link
IBM Big Insights


* If you are feeling froggy here is a full list of Hadoop Vendors.

So I dusted off an old workstation I had in my attic from a couple of years ago. This is a Dell Precision T3400 workstation that I used a few moons ago for the same reason. A couple of years ago to run a handful of minimal Linux instances this system was fine. To make is useful today it needed some work. I obviously had to upgrade the Ubuntu to 14.04 as it was still some version in the 12 range. I wont bother with the details of these gymnastics as I believe the Ubuntu community has this covered.

While I did take a look at VirtualBox and VMware Player I think I wanted to use something open source but also sans GUI. I realize there are command line options for both VirtualBox and VMware but in the end using QEMU/ KVM with libvirt fit the bill as the most open source and command line way to go. For those new to virtualization and in need of a GUI one of the other solutions might be a better fit for you. Its left as an exercise for the reader to get QEMU and libvirt installed on your OS. An important point I worked through was creating a bridged adapter on the host machine. I only have one installed network card and wanted my hosted machines on my internal network. In short you are creating a network adapter that the virtualization system can use on top of a single physical adapter. The server can still use the regular IP of the original adapter but now virtual host can act as if they are fully on the local network. Since this system wont be leaving my lab this a perfect solution. If you want something mobile on your laptop such as you should consider an internal or host only network setup. Make sure you reboot after changing the following.

cat /etc/network/interfaces
auto lo
iface lo inet loopback

auto br0
iface br0 inet static
bridge_ports eth0
bridge_fd 9
bridge_hello 2
bridge_maxage 12
bridge_stp off

Although QEMU supports a variety of disk formats natively I decided to convert the images I collected for my Hadoop play ground into qcow2 the native format for QEMU. I collected a variety of “sandbox” images from a number of Hadoop and Big Data vendors. Most come in OVA format which is really just a tarball of the vmdk file and ovf file describing the disk image. To convert you simply extract the vmdk file: 

tar -xvf /path/to/file/Hortonworks_Sandbox_2.1_vmware.ova

and convert the resulting vmdk file:

qemu-img convert -O qcow2 Hortonworks_Sandbox_2.1-disk1.vmdk /path/to/Hortonworks_Sandbox_2.1-disk1.qcow2

Have more than 1 vmdk like the MapR sandbox? No problem: 

qemu-img convert -O qcow2 ./MapR-Sandbox-For-Hadoop-3.1.1_VM-disk1.vmdk ./MapR-Sandbox-For-Hadoop-3.1.1_VM-disk2.vmdk ../MapR-Sandbox-For-Hadoop-3.1.1_VM.qcow2


The use of the images is quick and easy:

virt-install --connect qemu:///system --ram 1024 -n HWXSandbox21 -r 2048 --os-type=linux --os-variant=rhel6 --disk path=/home/user/virtual_machines/Hortonworks_Sandbox_2.1-disk1-VMware.qcow2,device=disk,bus=virtio,format=qcow2 --vcpus=2 --vnc --noautoconsole --import

If you were to go the GUI route one could use virt-manager at this point to get a console and manage the machine. Of course in the interest of saving RAM and pure stubbornness I use the command line. First find the list of installed systems and then open a console to that instance. 

virsh list --all
virsh console guestvmname


While this will get you a console, you might not see all the console output you want to when using a monitor. For CentOS you need to create a serial interface in the OS (ttyS0) and instruct the OS to use that new interface. From this point you one should able to log in, find the IP address and be off to the races. With the use of the new serial interface you will see the normal boot up action if you reboot.

The real saving here is memory. Turning of Xserver and all the unnecessary OS services saves memory for running the various sandboxes. This should allow you to use Linux and a machine with 8 to 16GB of RAM effectively for development.

The next step will be to automate the installation base operating systems via PXE boot environment followed by installation of a true multinode virtualized Hadoop cluster. That I will leave for another post.

Updating Ambari Stack Repos via REST API

If you use Ambari to deploy Hadoop you may have had occasion to need to change your the repo used after you installed. At the time of this article version 1.5.1 of Ambari requires you do this via API as it is not exposed in Ambari web. The HDP repo file placed in /etc/yum.repos.d is generated by Ambari under certain conditions. In any case this is a good review of how to use the REST interface to manipulate Ambari. The basic call includes the use of the curl command to GET and PUT items to the Ambari API layer. Want some basic information about your cluster try this:

curl -H “X-Requested-By: ambari” -X GET -u admin:XXXXX http://AMBARIHOSTNAME:8080/api/v1/clusters

Screen Shot 2014-05-20 at 1.41.59 PM

You can check your existing reponame via this command:

curl -H “X-Requested-By: ambari” -X GET -u admin:XXXXX http://AMBARIHOSTNAME:8080/api/v1/stacks2/HDP/versions/2.0.6/ operatingSystems/centos6/repositories/HDP-2.0.6 

Screen Shot 2014-05-20 at 1.42.28 PM

Then you can set this repo name to whatever name you need via this PUT command. This again is help in situations where perhaps the internal repo name has changed post install. After this call you can double check your work by rerunning the above command OR via Ambari web interface under Admin>Clusters>Repositories.

curl -H “X-Requested-By: ambari” -X PUT -u admin:XXXXX http://AMBARIHOSTNAME:8080/api/v1/stacks2/HDP/versions/2.0.6
/operatingSystems/centos6/repositories/HDP-2.0.6 -d ‘{“Repositories”: {“base_url”: “http://REPOHOSTNAMEHERE/HDP/centos6/2.x/updates/”, “verify_base_url”: false}}’

This works for all the different repos for each stack. You can play with the GET command above to explore the different options available for each installed stack.

Look Ma…Im Famous, Im in Linux Journal.

But seriously it is kind of cool that I finally made it into Linux Journal. It has been one of my long time reads and subscriptions. I have learned lots of cool tech from what I consider the original magazine in Linux. Anywho take a look at my article on Yarn in the latest edition of Linux Journal (the HPC edition too!). It’s about YARN and understanding the difference in HPC scheduling systems compared to YARN.

Screen Shot 2014-04-02 at 7.52.49 PM

Connecting Talend to Hadoop


If you have not used Talend its a great tool to do ETL visually. The drag and drop WYSIWG interface (which is Eclipse basically) allows you to program in Java without knowing you are programming (see the “code” tab for your job). I have played a lot with Talend as a tool. Being that there is a free version for Big Data its easy to get this downloaded and running. When installing be sure to closely follow and understand the Java requirements in the documentation and you should be good to go. Once running you will probably want to be able to use components from the component palette (menu of icons along the right side of the screen) for Big Data operations.

While simply dragging the components to the designer pane of the job (the center of the screen) and editing the component properties for that component will work the concept of a “context” in Talend is very useful.

Screen Shot 2013-09-10 at 11.19.19 AM

You can take a look at the help file on contexts within Talend but I dont think there is a Big Data specific version so I thought I would make some notes on this. On the far left in the repository view you will see a submenu called contexts that you can control click (or right click) on to add a context. After naming it you can basically add what amounts to global variables for use in the properties of your components. I have a context called HadoopConnectivity that contains the names of the various Hadoop daemons needed by different Big Data Components.

Screen Shot 2013-09-10 at 10.00.22 AM

This includes the Job Tracker, Namenode, Templeton, Hbase, Hive, Thrift, user name as well as a data directory and a jarsdirectory.You will need to copy a number of jars to your local system from Hadoop and register them with Talend. For the Talend version I am using OpenStudio for Big Data 5.3.1r104014 I have datanucleus-core, datanucleus-rdbms, hcatalog, hive-exec, hive-metastore and javacsv. I connected this to Hortonworks Data Platofrm Sandbox version 1.3.2. You will see the requirement for these jars in the various components properties as you use them. Your list be be slightly different but he point is to simply find a local place to dump the jars and then reference the local location using a Talend context rather then constantly using a full path. You will also need to make sure you import your context to your job.This amounts to clicking on the “Contexts” tab in the job properties and clicking on the clipboard to import the contexts.

Screen Shot 2013-09-10 at 10.15.54 AM


Screen Shot 2013-09-10 at 10.16.37 AM

To use the context in the component I have captured a screenshot showing the HDFS component using a standard double quote and append syntax to access the global variable:

“hdfs://” + context.NN_HOST + “:8020/”

Screen Shot 2013-09-10 at 10.35.17 AM

Contexts are a really useful trick for making your the components of your job transportable across systems. The same job you create for using in a test or dev cluster (or the sandbox) can be rapidly moved to run on a new cluster by simply altering the context.

Yarn is coming and this will change the daemons used in Hadoop and naturally will change the context information but again you can have more than one context as well as more than one cluster. Contexts are a great way to rapidly abstract cluster specific details from your Talend jobs. I will follow up soon with a Yarn based post connecting Talend to Hadoop 2. There are tons of other useful tools that can be used with Hadoop. I will likely present a selection of them in subsequent blog posts.


Arun talks Yarn

In case you werent there here is a link to Arun Murthy one of the founders of Hortonworks talking abour Yarn at Hadoop Summit 2013. This is a topic near and dear to my heart being that I have spent a good deal of my career working in HPC under the banners of workload management but also resource negotiation across nodes.

Don’t over think Hadoop.

I was watching TV yesterday and flipped past the film Moneyball on one of the movie channels. I had seen this movie before and I have heard all about the relationship to BI and Big Data. I guess for me I saw the central theme of that movie as disruption as it relates to innovation.

The disruptive power of not really needing high end hardware to build a supercomputer has the market scrambling to force Hadoop into a model that fits many years of coaching by industry giants. You have to have super high end hardware with many layers of backup, redundancy and failsafes. You have to come in on the weekend and neglect your family to “save the day” for some silly website powering someones else’s critical functionality. Stuck in a rut of selling high end nodes the thought of converting to disposability of slave nodes combined with the resiliency of the power that large numbers of slaves brings is disruptive to the entire industry selling into modern data warehouse powered businesses. Not needing Fiber or even 10GbE means networks are smaller, less expensive and closer to disposable. No need for virtualization you say? How can this be? Virtualization is good for everything isn’t it? Its faster, more dense and cost efficient right? Just ask your vendor. They will tell you all about it. Don’t even get me started on the eternal battle of share everything versus share nothing. I have argued for share nothing for many years in classic HPC to no avail. How else can sell massive network or cluster based storage? Query the market and you will find no end to the perversion of the original intent of Hadoop. Changes to the file system, replication level, placement of data into traditional databases and placement of MapReduce over the same. Hogwash. Buy nodes. Lots of them. Cheap ones with zero features for redundancy (no you don’t need two power supplies or 18 NIC cards and shouldn’t be paying more than $5k MAX). If they break beyond repair put them in the dumpster (or may be donate them to a good cause). Don’t over think Hadoop. Start using it and get educated. It will be disruptive and cause people to fear change (including in your own company) but at this stage much like “cloud” was a few years ago if you don’t have a strategy for Hadoop in place you are going to be sitting on your couch in October watching competitive company brand X win the world series of your field.