Block placement and Multi-tenancy – Where is your data?

Most folks tend to think of the Hadoop storage layer as a large hard drive. At a high level I guess this is a fair assumption. The real issue comes to light when one considers actual block placement in Hadoop. Many architects want to design systems for multitenancy using Hadoop as a core part of their design. An HDFS system has a very particular strategy for block placement. Within a single cluster blocks cannot be restricted to a set of hosts (using a default build of HDFS) or even a set of drives within a host. This means that even though users might feel that both HDFS POSIX ACLs level permissions protect data from unauthorized access this only applies to folks using the front door. As they say locks are for honest people. Only users attempting to access your data via Hadoop based methods will be blocked. The unscrupulous could still for instance directly access nodes and therefore the blocks placed in the Linux file system accessed by Hadoop. Hadoop itself leverages the Linux file system. Encryption of this data at rest is a partial answer to this dilemma but this is a very new feature of Hadoop 3 not yet adopted by most distribution vendors. The classical answer has been to engage a third party vendors for a more robust answer to the issue of Hadoop encryption at rest. At the end of the day the block could still be access (isolating direct log in to data node helps) and theoretically decrypted. Im sure this could easily be the topic or a least a sub plot of spy movie.

The other alternative is to simply NOT use HDFS. The MapR FS for example has none of these issues because it essentially is a true file system. Block are not the unit of replication, the Namenode meta data is not an issue as its distributed and one cannot access the MapR FS nor any of its components in any way other than the front door (via the MapR client). Along with the ability to place data not only on specific nodes but also on specific drives within a node MapR FS is really a more elegant solution. For these reasons true data isolation is really guaranteed with MapR FS. With HDFS the correct answer to guarantee data isolation is really an entirely alternate HDFS subsystem (another cluster). HDFS has the concept of Namespaces but that really addresses the small files issue and isnt an enforcement method for data placement on nodes or drives. The comparison on a features by feature basis tends to leave HDFS wanting.

The lesson here is to really study what is included and how components are configured prior to engaging in the use of Hadoop for a Multi-tenant infrastructure.

Connecting to MapR via JDBC

I recently needed to do some connection testing via JDBC to Hiveserver 2 in MapR 4.0.2 in a secure and kerberized cluster. Since this was new to me I thought it would be worthwhile to write-up my notes as I will sure forget half of this in less than a few weeks. Just in case you are needing to connect via JDBC to a Kerberized cluster here are some tips.

MapR has its own login system called maprlogin for use in a secure cluster setup that requires authentication via password or Kerberos to use the cluster.

Once completed users can act on the cluster via the command line normally via the hadoop commands (or through the Linux command line in NFS mounted MapR-FS).

For JDBC connections users must use the principal configured by the Kerberos and Hadoop Admins for the hive service (not their user principal – that is used in the previous step). See the setting “hive.server2.authentication.kerberos.principal” along with the matching keytab specified in “hive.server2.authentication.kerberos.keytab” which in MapR is typically under /opt/mapr/conf/mapr.keytab.

As a quick refresher here is what we are dealing with in terms of a Kerberos principal for a JDBC connection string:


HS2.FQDN – Hive server 2 fully qualified Domain name and port – Default is 10000

primary – username – user must exist on Hiveserver2 node. You will normally see a user principal that matches the service name – i.e., hive, yarn etc.

Instance – use a FQDN that is resolvable both forward and reverse. This is a property of Apache Hive not specific to the MapR distribution (Hive version 0.13 was used in this testing against MapR 4.0.2). Hive attempts to resolve the hostname (in green above) when connecting via Kerberos.

Realm – correct Kerberos Realm name – see your local /etc/krb5.conf or consult your Kerberos Administrator.

And here are some examples of how to connect. Beeline is a Hive service you can use that is built into Hive. There are many tools that use a similar methods of connection that can be used.

You may need to either call the beeline client with a fake user name and password or hit enter twice interactively. This is a known bug.

Java Client



HDFS ACLs – Managing Data from the ground up.

HDFS and the permissions placed upon it are the first line of authorization in Hadoop. Every other service that relies upon HDFS in Hadoop must adhere to the permissions enforced by HDFS. This is the place to start when concerned about the security of your Hadoop system. One simply needs to apply the principle of least privilege just like with everything else in IT. Many people fret about Kerberos for authorization and encryption of data at rest and in motion, all of which are worthy discussions but ignore the basic principles of the file system (permissions and quotas) as one of the easiest way to both restrict and selectively share information.

A cool new feature in HDFS inclusion of  ACLs.The full documentation is provided here but I always like to write the guide for the impatient aka quick start articles. I wont visit every detail of this feature. Using classic users and groups from Linux (POSIX) you could restrict access to files and directories in HDFS using a pretty standard permissions model with a few exceptions like the execute bit (no executable files in HDFS) and setuid/setgid but the sticky bit is still there.


Once defined, two users and two unique groups in Linux (do this on the Namenode – this is where the Namenode picks up group membership) a quick investigation of the ACLs could be  launched. Many people ask “How do I enable complex access multiple groups and users?” HDFS was recently extended to support POSIX to support more complex use cases. So when user Bob attempts to write to user Joe’s home directory access is denied. Notice the plus sign now indicating some extended ACLs have been set on a directory.

Although you can see the actual ACLs in the permission denied error a slightly closer look at the extended ACLs shows the reason more clearly:

That directory is owned by user Joe and even though Joe is in poc_group2 that group does not have write permissions. That group can list the contents of the Joe’s home but unless you are in poc_group1 with full access or poc_group2 with read access you can see anything in Joe’s home (group other is set to no access). Another cool thing to notice is use of the mask setting. One could also set mask that overrides the other settings such as:

In this case you end up with an effective setting of read only since the mask now overrides the permissions of the ACL. The other interesting point is that the “default” settings have to do with the behavior of the child objects.

So what? What does all this have to do with anything. Well for starters it’s a great way to exert basic control over your HDFS layout. If you want to stop folks from simply filling up your file system with junk this is great way to stop that (along with Quotas). Lots of folks in the market talk about adopting Hadoop as a challenge due to lack of control over the environment. Hadoop permission and ACLs are the lowest levels of control one has over what is placed in Hadoop and probably one of the least well understood.


Hadoop Encryption at Rest

At Rest, as in not motion, not REST as in web services. For a long time the only real answer Hadoop had for encryption at rest was to leverage a third-party tool or consider the use of LUKS for whole disk encryption. What I see customers asking for these days is really encryption in motion (aka wire encryption), encryption at rest (at a HDFS layer) plus policies that will eliminate data from specific directories in HDFS based upon some business rule. The good news is that as of Hadoop 2.6 we now have HDFS-6134 in play so there is light at the end of the security tunnel.

The implementation of this new transparent encryption is supported via the normal Hadoop Filesystem Java API, the libhdfs C API and WebHDFS (REST) API. The great news is that once it is set up normal HDFS ACL control access to reading and writing so while there is some administration upfront from a user perspective there is not a terribly large new burden. This essentially means that third-party integration work should be largely left intact.

There is now a Key Management Server (KMS) used to create keys for the encryption process of “encryption zones” also know as directories in HDFS.

So how does all this happen? The design doc describes both the read and write action processes. Illustrated here is the read process:


So how does one functionally use encryption zones. Cloudera has a great docs page talking about how to create encryption zones based upon the technology used (over hdfs).

Like most newly invented technology the design doc also calls out some potential issues with the design. While there are some potential vulnerabilities called out in the spec I would still say this is a massive step in the right direction. I also noted this first version really only uses AES-CTR.

It might seem like this is small matter but in the larger context of a security discussion native encryption at rest is an important part of the Hadoop puzzle.


Hadoop and the mystery of the version number

When I’m working with people on Hadoop I ask what you would think is a simple question. What version of Hadoop are you using? The answer normally is one of several attempts to explain what’s installed including –


Answer Translation
Hortonworks/Cloudera This is my Hadoop Distribution.
Hortonworks 2 I know we aren’t using version 1.
Hadoop 2 I dont know my distro but I’m using Hadoop 2.
Apache someone else is working this. I have no idea.

In reality though it’s not as straight forward as you might think. I think the easiest way to get the most bang for your buck is to simply take a look at the version number of the package installed. So on yum based systems you could simply do

and get back of list of whats installed and whats available. You could also simply query the rpm database:

If you run SLES you will need to do zypper and on windows look at your add/remove programs dialog on most major newer versions of windows. In the end you are still left with this cryptic string to decode. If you look closely there is a method to the madness and it helps to know this level of detail when working in an area like Hadoop where minor version numbers or a build number could make all the difference.

For example:
package nameversionarchitecture

The version number in this case is from a Hortonworks distribution so  we have a seven digit (8 places) version number.

package versionHDP Versionbuild number 385

It’s important to know both the version of Hadoop and the version of the package you are working on. For example if someone says “I’m working on Hive”. You really need to know what hive version AND what Hadoop version because the two are intimately linked. If someone gives you the hive package string:

It’s really not enough information for you to tell what version of Hadoop someone is using. You know they are using HDP so one either asks for the same information on the Hadoop package installed OR goes to the release notes for the distro to decode the distribution version number into the Apache Hadoop version. Each distribution uses a different combination of packages and it pays to know EXACTLY what you are getting when you download a distro. Cloudera has exactly the same issues and their packaging may in fact be even more forthcoming in that they tell you how many patches were applied. Hortonworks does this in the context of their release notes.

package namepackage version+CDH version+patches


Hopefully now you have a better understanding of Hadoop package versions.


Hive with JSON data

I stumbled across this and thought it would be helpful to write this up to save everyone else some time. So I went to use JSON with Hive 13 for what I thought was a pretty simple use case of creating a table with JSON data. I was looking for the right SerDe and stumbled across this blog entry stating that we should use the code from this github repo to make a jar that works with Hive 0.13. So here we go…

Sigh…so after some searching I stumbled across another few blog posts and finally a github repo fork that I cloned and built to create a jar that works with Hive 13 and Hadoop 2.4.

Ahhhh. So much better. I am using the latest HDP 2.1 sandbox for writing code so my packages are:

I will create another blog post (and link it here) to explain the version numbers of the packages in HDP.

Many Thanks to KunBetter who saved the day for us in our work at a recent customer.


This saved us many hours of aggravation. Open Source works. Give it a try. Someday someone on the other side of the planet may have the answer you need.


Implementing Tools Interface in MapReduce

I was banging around with MapReduce the other day and web surfing. I came across this post on implementing the Tools interface in your MapReduce driver program. Most of the first level examples show a static main method which as the author describes doesn’t allow you to use the configuration dynamically (i.e., you cannot use -D at the command line to pass options to the configuration object). For fun I took Word Count and refactored it using this suggestion. I thought it might be good to share this with folks. I have posted the full code to github and display it below as well.

Using this method you can now pass options to the configuration option via the command line using -D. This is a handy addition to any MapReduce program.

Creating a Multinode Hadoop Sandbox

One of the great things about all the Hadoop vendors is that they have made it very easy for people to obtain and start using their technology rapidly. I will say that I think Cloudera has done the best job by providing cloud based access to their distribution via Cloudera Live. All vendors seems to have a VMware and VirtualBox based sandbox/trial image. Having worked with Hortonworks I have the most experience with and thought a quick initial blog post would be helpful.

While one could simply do this installation from scratch following the package instructions, it’s also possible to short-circuit much of the setup as well as take advantage of the scaled down configuration work already put into the virtual machine provided by Hortonworks. In short the idea would be to use the VM as a single master node and simply add data nodes to this master. Running this way provides and easy way to install and expand an initial Hadoop system up to about 10 nodes. As the system grows you will need to add RAM to not only the virtual host but to Hadoop Daemons as it scales. A full script is available here. Below is a description of the process.

The general steps include:

1. The Sandbox 

Download and install the Hortonworks Sandbox as your head node in your virtualization system of choice. The sandbox tends to be produced prior to the latest major release (compare yum list hadoop *\ output). Make sure you have first enabled Ambari by running the script in root’s home directory and reboot.

In order to make sure you are using the very latest stable release and that the Ambari server and agent daemons have matching versions upgrading is easiest. This includes following:

2. The Nodes  

Install 1-N Centos 6.5 nodes as slaves and prep them as worker nodes. These can be default installs of the OS but need to be on the same network as the Ambari server. This can also be facilitated via pdsh (but this requires passwordless ssh) OR better yet simply creating one “data node” image via a PXE boot environment or snapshot of the Virtual machine to quickly replicate 1-N nodes with these changes.

If you want to use SSH you can do this from the head node to quickly enable passwordless SSH:

You then want to make sure you make the following changes to your slave nodes. Again this could easily be done via pdsh by pcdp the a script to each node and executing with the following content.

Push this file to slave nodes and run it. This does NOT need to be done on the sandbox/headnode.

3. Configure Services Run the Ambari “add nodes” GUI installer to add data nodes. Be sure to select “manual registration” and follow the on-screen prompts to install components. I recommend installing everything on all nodes and simply turning the services off and on as needed. Also installing the client binaries on all nodes helps to make sure you can do debugging from any node in the cluster.

Ambari Add Nodes Dialog

4. Turn off select services as required. 

There should now be 1-N data nodes/slaves attached to your Ambari/Sandbox head node. Here are some suggested changes.  
1. Turn off large services you aren’t using like HBase, Storm, Falcon. This will help save RAM.  
Ambari Services
2. Decommission the Data node on this machine! No! a head node is not a datanode. If you run jobs here you will have problems.  
3. HDFS Replication factor – This is set to 1 in the sandbox because there is only one datanode. If you only have 1-3 datanodes then triple replication doesn’t make sense. I suggest you use 1 until you get over 3 data nodes at a bare minimum. If you have the resources just start with 10 data nodes (that’s why it’s called Big Data). If not stick with replication factor of 1 but be aware this will function as a prototype system and wont provide the natural safeguards or parallelism of normal HDFS.  
4. Increase RAM to Head node – At a bare minimum Ambari requires 4096MB. If you plan to run the sandbox as a head node consider increasing from this minimum. Also consider giving running services room to breath by increasing the RAM allocated in Ambari for each service. Here is a great review and script for guestimating how to scale services for MapReduce and Yarn.  
5. NFS – to make your life easier you might want to enable NFS on a data node or two.

Creating a Virtualized Hadoop Lab

Over the past few years I have let my home lab dwindle a little. I have been very busy at work and for the most part I was able to prototype what I needed for work on my laptop given the generous amount of RAM on the MacBook Pro. That said I was still not able to have the type of permanent setup I really wanted. I know lots of guys who go to the trouble of setting up racks to create their own clusters at home. Given that I really only need a functional lab environment and don’t want to waste the power, cooling or space in my home I turned to virtualization. While I would be the first one in the room to start babbling on about how Hadoop is not supposed to be virtualized in production it is appropriate for development. I wanted a place to test and use a variety of Hadoop virtual machines:

Vendor Distro URL
Hortonworks Hortonworks Data Platform
Cloudera Quick Start VMs
MapR MapR Distribution for Apache™ Hadoop®
Oracle Big Data Lite link
IBM Big Insights


* If you are feeling froggy here is a full list of Hadoop Vendors.

So I dusted off an old workstation I had in my attic from a couple of years ago. This is a Dell Precision T3400 workstation that I used a few moons ago for the same reason. A couple of years ago to run a handful of minimal Linux instances this system was fine. To make is useful today it needed some work. I obviously had to upgrade the Ubuntu to 14.04 as it was still some version in the 12 range. I wont bother with the details of these gymnastics as I believe the Ubuntu community has this covered.

While I did take a look at VirtualBox and VMware Player I think I wanted to use something open source but also sans GUI. I realize there are command line options for both VirtualBox and VMware but in the end using QEMU/ KVM with libvirt fit the bill as the most open source and command line way to go. For those new to virtualization and in need of a GUI one of the other solutions might be a better fit for you. Its left as an exercise for the reader to get QEMU and libvirt installed on your OS. An important point I worked through was creating a bridged adapter on the host machine. I only have one installed network card and wanted my hosted machines on my internal network. In short you are creating a network adapter that the virtualization system can use on top of a single physical adapter. The server can still use the regular IP of the original adapter but now virtual host can act as if they are fully on the local network. Since this system wont be leaving my lab this a perfect solution. If you want something mobile on your laptop such as you should consider an internal or host only network setup. Make sure you reboot after changing the following.

cat /etc/network/interfaces
auto lo
iface lo inet loopback

auto br0
iface br0 inet static
bridge_ports eth0
bridge_fd 9
bridge_hello 2
bridge_maxage 12
bridge_stp off

Although QEMU supports a variety of disk formats natively I decided to convert the images I collected for my Hadoop play ground into qcow2 the native format for QEMU. I collected a variety of “sandbox” images from a number of Hadoop and Big Data vendors. Most come in OVA format which is really just a tarball of the vmdk file and ovf file describing the disk image. To convert you simply extract the vmdk file: 

tar -xvf /path/to/file/Hortonworks_Sandbox_2.1_vmware.ova

and convert the resulting vmdk file:

qemu-img convert -O qcow2 Hortonworks_Sandbox_2.1-disk1.vmdk /path/to/Hortonworks_Sandbox_2.1-disk1.qcow2

Have more than 1 vmdk like the MapR sandbox? No problem: 

qemu-img convert -O qcow2 ./MapR-Sandbox-For-Hadoop-3.1.1_VM-disk1.vmdk ./MapR-Sandbox-For-Hadoop-3.1.1_VM-disk2.vmdk ../MapR-Sandbox-For-Hadoop-3.1.1_VM.qcow2


The use of the images is quick and easy:

virt-install --connect qemu:///system --ram 1024 -n HWXSandbox21 -r 2048 --os-type=linux --os-variant=rhel6 --disk path=/home/user/virtual_machines/Hortonworks_Sandbox_2.1-disk1-VMware.qcow2,device=disk,bus=virtio,format=qcow2 --vcpus=2 --vnc --noautoconsole --import

If you were to go the GUI route one could use virt-manager at this point to get a console and manage the machine. Of course in the interest of saving RAM and pure stubbornness I use the command line. First find the list of installed systems and then open a console to that instance. 

virsh list --all
virsh console guestvmname


While this will get you a console, you might not see all the console output you want to when using a monitor. For CentOS you need to create a serial interface in the OS (ttyS0) and instruct the OS to use that new interface. From this point you one should able to log in, find the IP address and be off to the races. With the use of the new serial interface you will see the normal boot up action if you reboot.

The real saving here is memory. Turning of Xserver and all the unnecessary OS services saves memory for running the various sandboxes. This should allow you to use Linux and a machine with 8 to 16GB of RAM effectively for development.

The next step will be to automate the installation base operating systems via PXE boot environment followed by installation of a true multinode virtualized Hadoop cluster. That I will leave for another post.