DZone – The Evolution of MapReduce and Hadoop

cover Recently I authored a section of the DZone Guide for Big Data 2014. I wrote about MapReduce and the evolution of Hadoop. There is a ton of buzz in the market about speed to insight and the push toward alternative DAG engines like Spark. I see these new techs as exciting and awesome. I welcome innovation and creativity in the Hadoop market. I also temper this excitement with a bit of reality. The reality is that technology has a maturity curve. It comes out from the huge brain incubators like Google to the rest of us. Much like MapReduce, HDFS and Hbase these technologies have a long road until they grow into a vibrant open source community with dedicated developers. Then comes a stage where the project may become its own incubator project within Apache or other open source framework. Finally, you may see the tech evolve to the point that its included in a supported Hadoop distribution. Now in their efforts to compete you will see vendors clamor over one another to be the first to include hot tech brand X in their offering. I see this as competition and healthy but I will say this could lead to the inclusion of technology that we may not see removed from a distro immediately but it may just fade away. Much like RedHat many years ago with a Linux distribution, Hadoop distributions are still relatively new. Their choices are still small. Eventually, hopefully, we will see the number of projects grow to the point where vendors are actually refusing or removing projects that are not longer relevant. I’m not sure we are there yet. All that said do I think MapReduce is going to dry up and blow away? No I don’t. Do I think Spark and alternative engines will gain steam as they prove their efficiencies? Yes I do. I think most folks need to hMRandBigDataear more than one group validate scalability. I’m sure Google has tooling that works for them. Are they sharing it or better yet has it matured to the point that we mere mortals can use it? I think that remains to be seen. MapReduce and further more Hive which uses MapReduce has a HUGE ecosystem of tools that depend on Hive.Also take a look at Stinger Next to see some exciting new developments of Hive. Take a walk down the rows of tables of vendors at the next Hadoop Summit or Strata and ask them in detail which Hadoop tool their tool depends. Most likely you will eventually get to an answer that basically is Hive (or Hcatalog). Hive and MapReduce will be here for years to come if for no other reason than market penetration. Anyway, take a look at the new DZone Guide. Its has many good topics that may interest you.

Implementing Tools Interface in MapReduce

I was banging around with MapReduce the other day and web surfing. I came across this post on implementing the Tools interface in your MapReduce driver program. Most of the first level examples show a static main method which as the author describes doesn’t allow you to use the configuration dynamically (i.e., you cannot use -D at the command line to pass options to the configuration object). For fun I took Word Count and refactored it using this suggestion. I thought it might be good to share this with folks. I have posted the full code to github and display it below as well.

Using this method you can now pass options to the configuration option via the command line using -D. This is a handy addition to any MapReduce program.

Creating a Multinode Hadoop Sandbox

One of the great things about all the Hadoop vendors is that they have made it very easy for people to obtain and start using their technology rapidly. I will say that I think Cloudera has done the best job by providing cloud based access to their distribution via Cloudera Live. All vendors seems to have a VMware and VirtualBox based sandbox/trial image. Having worked with Hortonworks I have the most experience with and thought a quick initial blog post would be helpful.

While one could simply do this installation from scratch following the package instructions, it’s also possible to short-circuit much of the setup as well as take advantage of the scaled down configuration work already put into the virtual machine provided by Hortonworks. In short the idea would be to use the VM as a single master node and simply add data nodes to this master. Running this way provides and easy way to install and expand an initial Hadoop system up to about 10 nodes. As the system grows you will need to add RAM to not only the virtual host but to Hadoop Daemons as it scales. A full script is available here. Below is a description of the process.

The general steps include:

1. The Sandbox 

Download and install the Hortonworks Sandbox as your head node in your virtualization system of choice. The sandbox tends to be produced prior to the latest major release (compare yum list hadoop *\ output). Make sure you have first enabled Ambari by running the script in root’s home directory and reboot.

In order to make sure you are using the very latest stable release and that the Ambari server and agent daemons have matching versions upgrading is easiest. This includes following:

2. The Nodes  

Install 1-N Centos 6.5 nodes as slaves and prep them as worker nodes. These can be default installs of the OS but need to be on the same network as the Ambari server. This can also be facilitated via pdsh (but this requires passwordless ssh) OR better yet simply creating one “data node” image via a PXE boot environment or snapshot of the Virtual machine to quickly replicate 1-N nodes with these changes.

If you want to use SSH you can do this from the head node to quickly enable passwordless SSH:

You then want to make sure you make the following changes to your slave nodes. Again this could easily be done via pdsh by pcdp the a script to each node and executing with the following content.

Push this file to slave nodes and run it. This does NOT need to be done on the sandbox/headnode.

3. Configure Services Run the Ambari “add nodes” GUI installer to add data nodes. Be sure to select “manual registration” and follow the on-screen prompts to install components. I recommend installing everything on all nodes and simply turning the services off and on as needed. Also installing the client binaries on all nodes helps to make sure you can do debugging from any node in the cluster.

Ambari Add Nodes Dialog

4. Turn off select services as required. 

There should now be 1-N data nodes/slaves attached to your Ambari/Sandbox head node. Here are some suggested changes.  
1. Turn off large services you aren’t using like HBase, Storm, Falcon. This will help save RAM.  
Ambari Services
2. Decommission the Data node on this machine! No! a head node is not a datanode. If you run jobs here you will have problems.  
3. HDFS Replication factor – This is set to 1 in the sandbox because there is only one datanode. If you only have 1-3 datanodes then triple replication doesn’t make sense. I suggest you use 1 until you get over 3 data nodes at a bare minimum. If you have the resources just start with 10 data nodes (that’s why it’s called Big Data). If not stick with replication factor of 1 but be aware this will function as a prototype system and wont provide the natural safeguards or parallelism of normal HDFS.  
4. Increase RAM to Head node – At a bare minimum Ambari requires 4096MB. If you plan to run the sandbox as a head node consider increasing from this minimum. Also consider giving running services room to breath by increasing the RAM allocated in Ambari for each service. Here is a great review and script for guestimating how to scale services for MapReduce and Yarn.  
5. NFS – to make your life easier you might want to enable NFS on a data node or two.

Creating a Virtualized Hadoop Lab

Over the past few years I have let my home lab dwindle a little. I have been very busy at work and for the most part I was able to prototype what I needed for work on my laptop given the generous amount of RAM on the MacBook Pro. That said I was still not able to have the type of permanent setup I really wanted. I know lots of guys who go to the trouble of setting up racks to create their own clusters at home. Given that I really only need a functional lab environment and don’t want to waste the power, cooling or space in my home I turned to virtualization. While I would be the first one in the room to start babbling on about how Hadoop is not supposed to be virtualized in production it is appropriate for development. I wanted a place to test and use a variety of Hadoop virtual machines:

Vendor Distro URL
Hortonworks Hortonworks Data Platform
Cloudera Quick Start VMs
MapR MapR Distribution for Apache™ Hadoop®
Oracle Big Data Lite link
IBM Big Insights


* If you are feeling froggy here is a full list of Hadoop Vendors.

So I dusted off an old workstation I had in my attic from a couple of years ago. This is a Dell Precision T3400 workstation that I used a few moons ago for the same reason. A couple of years ago to run a handful of minimal Linux instances this system was fine. To make is useful today it needed some work. I obviously had to upgrade the Ubuntu to 14.04 as it was still some version in the 12 range. I wont bother with the details of these gymnastics as I believe the Ubuntu community has this covered.

While I did take a look at VirtualBox and VMware Player I think I wanted to use something open source but also sans GUI. I realize there are command line options for both VirtualBox and VMware but in the end using QEMU/ KVM with libvirt fit the bill as the most open source and command line way to go. For those new to virtualization and in need of a GUI one of the other solutions might be a better fit for you. Its left as an exercise for the reader to get QEMU and libvirt installed on your OS. An important point I worked through was creating a bridged adapter on the host machine. I only have one installed network card and wanted my hosted machines on my internal network. In short you are creating a network adapter that the virtualization system can use on top of a single physical adapter. The server can still use the regular IP of the original adapter but now virtual host can act as if they are fully on the local network. Since this system wont be leaving my lab this a perfect solution. If you want something mobile on your laptop such as you should consider an internal or host only network setup. Make sure you reboot after changing the following.

cat /etc/network/interfaces
auto lo
iface lo inet loopback

auto br0
iface br0 inet static
bridge_ports eth0
bridge_fd 9
bridge_hello 2
bridge_maxage 12
bridge_stp off

Although QEMU supports a variety of disk formats natively I decided to convert the images I collected for my Hadoop play ground into qcow2 the native format for QEMU. I collected a variety of “sandbox” images from a number of Hadoop and Big Data vendors. Most come in OVA format which is really just a tarball of the vmdk file and ovf file describing the disk image. To convert you simply extract the vmdk file: 

tar -xvf /path/to/file/Hortonworks_Sandbox_2.1_vmware.ova

and convert the resulting vmdk file:

qemu-img convert -O qcow2 Hortonworks_Sandbox_2.1-disk1.vmdk /path/to/Hortonworks_Sandbox_2.1-disk1.qcow2

Have more than 1 vmdk like the MapR sandbox? No problem: 

qemu-img convert -O qcow2 ./MapR-Sandbox-For-Hadoop-3.1.1_VM-disk1.vmdk ./MapR-Sandbox-For-Hadoop-3.1.1_VM-disk2.vmdk ../MapR-Sandbox-For-Hadoop-3.1.1_VM.qcow2


The use of the images is quick and easy:

virt-install --connect qemu:///system --ram 1024 -n HWXSandbox21 -r 2048 --os-type=linux --os-variant=rhel6 --disk path=/home/user/virtual_machines/Hortonworks_Sandbox_2.1-disk1-VMware.qcow2,device=disk,bus=virtio,format=qcow2 --vcpus=2 --vnc --noautoconsole --import

If you were to go the GUI route one could use virt-manager at this point to get a console and manage the machine. Of course in the interest of saving RAM and pure stubbornness I use the command line. First find the list of installed systems and then open a console to that instance. 

virsh list --all
virsh console guestvmname


While this will get you a console, you might not see all the console output you want to when using a monitor. For CentOS you need to create a serial interface in the OS (ttyS0) and instruct the OS to use that new interface. From this point you one should able to log in, find the IP address and be off to the races. With the use of the new serial interface you will see the normal boot up action if you reboot.

The real saving here is memory. Turning of Xserver and all the unnecessary OS services saves memory for running the various sandboxes. This should allow you to use Linux and a machine with 8 to 16GB of RAM effectively for development.

The next step will be to automate the installation base operating systems via PXE boot environment followed by installation of a true multinode virtualized Hadoop cluster. That I will leave for another post.

Updating Ambari Stack Repos via REST API

If you use Ambari to deploy Hadoop you may have had occasion to need to change your the repo used after you installed. At the time of this article version 1.5.1 of Ambari requires you do this via API as it is not exposed in Ambari web. The HDP repo file placed in /etc/yum.repos.d is generated by Ambari under certain conditions. In any case this is a good review of how to use the REST interface to manipulate Ambari. The basic call includes the use of the curl command to GET and PUT items to the Ambari API layer. Want some basic information about your cluster try this:

curl -H “X-Requested-By: ambari” -X GET -u admin:XXXXX http://AMBARIHOSTNAME:8080/api/v1/clusters

Screen Shot 2014-05-20 at 1.41.59 PM

You can check your existing reponame via this command:

curl -H “X-Requested-By: ambari” -X GET -u admin:XXXXX http://AMBARIHOSTNAME:8080/api/v1/stacks2/HDP/versions/2.0.6/ operatingSystems/centos6/repositories/HDP-2.0.6 

Screen Shot 2014-05-20 at 1.42.28 PM

Then you can set this repo name to whatever name you need via this PUT command. This again is help in situations where perhaps the internal repo name has changed post install. After this call you can double check your work by rerunning the above command OR via Ambari web interface under Admin>Clusters>Repositories.

curl -H “X-Requested-By: ambari” -X PUT -u admin:XXXXX http://AMBARIHOSTNAME:8080/api/v1/stacks2/HDP/versions/2.0.6
/operatingSystems/centos6/repositories/HDP-2.0.6 -d ‘{“Repositories”: {“base_url”: “http://REPOHOSTNAMEHERE/HDP/centos6/2.x/updates/”, “verify_base_url”: false}}’

This works for all the different repos for each stack. You can play with the GET command above to explore the different options available for each installed stack.

Look Ma…Im Famous, Im in Linux Journal.

But seriously it is kind of cool that I finally made it into Linux Journal. It has been one of my long time reads and subscriptions. I have learned lots of cool tech from what I consider the original magazine in Linux. Anywho take a look at my article on Yarn in the latest edition of Linux Journal (the HPC edition too!). It’s about YARN and understanding the difference in HPC scheduling systems compared to YARN.

Screen Shot 2014-04-02 at 7.52.49 PM

Hadoop Audit Logging

A recent interesting list of customer question included a query about audit logging in Hadoop. Specifically the logging of actions in Hive such as create table actions and queries. Audit was a feature added to Hadoop some time ago. Several JIRAs addressed it including HIVE-3505 and HIVE-1948 but wasnt really addressed in terms of documentation until recently via HIVE-5988.

The enablement of detail logging is done via log4j settings in /etc/hadoop/conf/ Changing from WARN to INFO enables the logging.

Screen Shot 2014-01-27 at 10.06.32 AM

In Hortonwork Data Platform this occurs around line 106 in the provided log4j file. The output of which places notes around actions that happen in HDFS. Since Hive also uses HDFS actions are logged in the /var/log/hadoop/hdfs/hdfs-audit.log

Screen Shot 2014-01-27 at 10.06.53 AM

Actions such as create table, show database etc. are listed for audit logging. Parsing of this log can yield detailed reports on how data was access in Hadoop for technologies that rely on HDFS.

Yarn Scheduler Load Simulator


I worked for years in HPC. Specifically I worked in workload management on tools like PBS Professional and Platform LSF. Most of the time when we wanted to understand what was happening with an HPC scheduling system with more options than was humanly possible to comprehend we tested on live systems including tuning options and making changes until the desired behavior was achieved. It was a little more art than science at times. No matter the platform I used one thing I always wanted was a scheduling simulator to allow me to scientifically optimize the scheduler behavior.

Screen Shot 2013-09-12 at 2.55.42 PM

If you are new to Hadoop you may not have heard of Yarn. This is a part of Hadoop 2.0 and will represent the next generation of Hadoop processing including new daemons, scheduling based upon resources (no more slots),  and the possibility execute more programming paradigms than just MapReduce (i.e., MPI, Storm etc). I think most people have recognized that while MapReduce is great for certain things there is probably also room at the party for a few other guests.

There is one JIRA specifically that I have been watching that has me all worked up. Yes, its a scheduling simulator. The proposed feature would offer such novel delights as detailed costs per scheduler operation allowing tuning of the entire cluster or queues themselves. From my work history I know how hotly debated scheduling policy becomes at companies. It typically becomes a struggle between competing groups jockeying for resources on their projects. This one JIRA would move that conversation from an argument to a discussion of simulation results run to provide information to make a decision versus a political battle.

Screen Shot 2013-09-12 at 2.55.30 PM

This will probably be one of the first things I play with in Hadoop 2.0 very soon. Below is a nice video showing the current state of the project. This has me excited to say the least. I have many other ideas to help bring the knowledge of HPC schedule to Hadoop where applicable but that will have to be another blog.

Hue Job Designs

So since I am a command line zealot I wanted to understand how to submit Hadoop jobs via Hue in the Hortonworks Sandbox. There are already a ton of tutorials included with the Sandbox that are focused on the use of Hive and other tools for some of the most popular use cases like sentiment analysis. Moving beyond those examples lets say I have my own jar files? Then what? Luckily there is a Job Designer icon that allows you to submit a whole range custom jobs and have those either run immediately or be scheduled later via Oozie. Just so I could understand the mechanics I simply jarred up WordCount. This is walked through here.  In case you weren’t aware the jar HADOOP_HOME/hadoop-examples.jar contains a number of examples for you to use. Provided you have a some working code you should be able to quickly cobble together a job design in Hue that allows you to submit that code to Hadoop.

Screen Shot 2013-09-11 at 4.23.38 PM


There is quite a collection of possibilities here including email, ssh, shell scripts and more. If you haven’t explored Job Designs in using Hue its worth your time. Previously Hue was only available with the Hortonworks Sandbox but is now available in fully distributed clusters as of HDP version 1.3.2 Hue can be installed manually.

UPDATE: I was reminded that Hue has been available as a tarball from for some time AND has been in not only the Cloudera distro but is also a part of Apache BigTop as well. This was a simple omission on my part as I usually write articles on my blog from an HDP perspective since I use it daily. Apologies to all those folks doing great work at the Hue. I know from working in this field that people love it.

Connecting Talend to Hadoop


If you have not used Talend its a great tool to do ETL visually. The drag and drop WYSIWG interface (which is Eclipse basically) allows you to program in Java without knowing you are programming (see the “code” tab for your job). I have played a lot with Talend as a tool. Being that there is a free version for Big Data its easy to get this downloaded and running. When installing be sure to closely follow and understand the Java requirements in the documentation and you should be good to go. Once running you will probably want to be able to use components from the component palette (menu of icons along the right side of the screen) for Big Data operations.

While simply dragging the components to the designer pane of the job (the center of the screen) and editing the component properties for that component will work the concept of a “context” in Talend is very useful.

Screen Shot 2013-09-10 at 11.19.19 AM

You can take a look at the help file on contexts within Talend but I dont think there is a Big Data specific version so I thought I would make some notes on this. On the far left in the repository view you will see a submenu called contexts that you can control click (or right click) on to add a context. After naming it you can basically add what amounts to global variables for use in the properties of your components. I have a context called HadoopConnectivity that contains the names of the various Hadoop daemons needed by different Big Data Components.

Screen Shot 2013-09-10 at 10.00.22 AM

This includes the Job Tracker, Namenode, Templeton, Hbase, Hive, Thrift, user name as well as a data directory and a jarsdirectory.You will need to copy a number of jars to your local system from Hadoop and register them with Talend. For the Talend version I am using OpenStudio for Big Data 5.3.1r104014 I have datanucleus-core, datanucleus-rdbms, hcatalog, hive-exec, hive-metastore and javacsv. I connected this to Hortonworks Data Platofrm Sandbox version 1.3.2. You will see the requirement for these jars in the various components properties as you use them. Your list be be slightly different but he point is to simply find a local place to dump the jars and then reference the local location using a Talend context rather then constantly using a full path. You will also need to make sure you import your context to your job.This amounts to clicking on the “Contexts” tab in the job properties and clicking on the clipboard to import the contexts.

Screen Shot 2013-09-10 at 10.15.54 AM


Screen Shot 2013-09-10 at 10.16.37 AM

To use the context in the component I have captured a screenshot showing the HDFS component using a standard double quote and append syntax to access the global variable:

“hdfs://” + context.NN_HOST + “:8020/”

Screen Shot 2013-09-10 at 10.35.17 AM

Contexts are a really useful trick for making your the components of your job transportable across systems. The same job you create for using in a test or dev cluster (or the sandbox) can be rapidly moved to run on a new cluster by simply altering the context.

Yarn is coming and this will change the daemons used in Hadoop and naturally will change the context information but again you can have more than one context as well as more than one cluster. Contexts are a great way to rapidly abstract cluster specific details from your Talend jobs. I will follow up soon with a Yarn based post connecting Talend to Hadoop 2. There are tons of other useful tools that can be used with Hadoop. I will likely present a selection of them in subsequent blog posts.