Creating a Multinode Hadoop Sandbox

One of the great things about all the Hadoop vendors is that they have made it very easy for people to obtain and start using their technology rapidly. I will say that I think Cloudera has done the best job by providing cloud based access to their distribution via Cloudera Live. All vendors seems to have a VMware and VirtualBox based sandbox/trial image. Having worked with Hortonworks I have the most experience with and thought a quick initial blog post would be helpful.

While one could simply do this installation from scratch following the package instructions, it’s also possible to short-circuit much of the setup as well as take advantage of the scaled down configuration work already put into the virtual machine provided by Hortonworks. In short the idea would be to use the VM as a single master node and simply add data nodes to this master. Running this way provides and easy way to install and expand an initial Hadoop system up to about 10 nodes. As the system grows you will need to add RAM to not only the virtual host but to Hadoop Daemons as it scales. A full script is available here. Below is a description of the process.

The general steps include:

1. The Sandbox 

Download and install the Hortonworks Sandbox as your head node in your virtualization system of choice. The sandbox tends to be produced prior to the latest major release (compare yum list hadoop *\ output). Make sure you have first enabled Ambari by running the script in root’s home directory and reboot.

In order to make sure you are using the very latest stable release and that the Ambari server and agent daemons have matching versions upgrading is easiest. This includes following:

2. The Nodes  

Install 1-N Centos 6.5 nodes as slaves and prep them as worker nodes. These can be default installs of the OS but need to be on the same network as the Ambari server. This can also be facilitated via pdsh (but this requires passwordless ssh) OR better yet simply creating one “data node” image via a PXE boot environment or snapshot of the Virtual machine to quickly replicate 1-N nodes with these changes.

If you want to use SSH you can do this from the head node to quickly enable passwordless SSH:

You then want to make sure you make the following changes to your slave nodes. Again this could easily be done via pdsh by pcdp the a script to each node and executing with the following content.

Push this file to slave nodes and run it. This does NOT need to be done on the sandbox/headnode.

3. Configure Services Run the Ambari “add nodes” GUI installer to add data nodes. Be sure to select “manual registration” and follow the on-screen prompts to install components. I recommend installing everything on all nodes and simply turning the services off and on as needed. Also installing the client binaries on all nodes helps to make sure you can do debugging from any node in the cluster.

Ambari Add Nodes Dialog

4. Turn off select services as required. 

There should now be 1-N data nodes/slaves attached to your Ambari/Sandbox head node. Here are some suggested changes.  
1. Turn off large services you aren’t using like HBase, Storm, Falcon. This will help save RAM.  
Ambari Services
2. Decommission the Data node on this machine! No! a head node is not a datanode. If you run jobs here you will have problems.  
3. HDFS Replication factor – This is set to 1 in the sandbox because there is only one datanode. If you only have 1-3 datanodes then triple replication doesn’t make sense. I suggest you use 1 until you get over 3 data nodes at a bare minimum. If you have the resources just start with 10 data nodes (that’s why it’s called Big Data). If not stick with replication factor of 1 but be aware this will function as a prototype system and wont provide the natural safeguards or parallelism of normal HDFS.  
4. Increase RAM to Head node – At a bare minimum Ambari requires 4096MB. If you plan to run the sandbox as a head node consider increasing from this minimum. Also consider giving running services room to breath by increasing the RAM allocated in Ambari for each service. Here is a great review and script for guestimating how to scale services for MapReduce and Yarn.  
5. NFS – to make your life easier you might want to enable NFS on a data node or two.