Single Node Ceph Cluster Sandbox

Well I had not heard of the Ceph project until recently. I was asked about making some thing work with the Ceph S3API. After looking at these few different sites. I was able to proceed in creating a single node version of what was supposed to be 6 nodes. I wanted to be sure to storeshare the code for this.

After the preparation steps I was into the actual installation of Ceph. I found a few packages were missing both things that I wanted and a few requirements.

now lets get down to business.

Probably one of the most imporant parts was the editing of the ceph.conf as described here. As well as the fact that I missed the “1” in the above “ceph-deploy osd activate” command. You have to look closely. To use the S3 endpoint you have to set up the ceph gateway (Rados server).

once I had done that I simply needed a new user to make a connection the S3 commandline.

I then tested the python program suggested by the docs:

Then S3. Of course you need to install the S3 command line

As you can see most calls to the gateway are supported:


Hadoop Client Setup for Azure Based Edge Node accessing ADLS

In my work I have the need to try unusual things. Today I needed to setup a hadoop client aka “edge node” that was capable of of access ADLS in Azure. While I’m sure this is documented somewhere here is what did.

Repo Config

In order to obtain the packages I needed that would allow me to access ADLS I went to Hortonworks docs. Based upon the fact that HDInsights 3.6 maps to HDP 2.6 I was fairly sure the repo link for Hortonworks was the way to go. Today I used Ubuntu 16.04 so I grabbed the appropriate link for the HDP repo file.  I then was able to follow the appropriate instructions in HDP docs for Ubuntu.

Next its important to Configure the correct core-site.xml. The version that comes from just installing hadoop-client is essentially blank and the version that is required to access ADLS is special. Here is a copy my core-site with critical pieces redacted. You also need to generate a service principal and add it to your Azure account in order to complete these values.

Once that is in place you should be able to use “adl:///” URI against your Azure ADLS at the command line. The nice part about ADLS is that is more Hadoop like than your average Azure BLOB store.

You can now use this by hand and in scripts as needed. You can also access ADLS programmatically via REST API.

Automated Ambari Install

I routinely have need for a single node install of HDP for integration work. Of course I could script all of that by hand and not use a cluster manager but what would I have to blog about. I routinely prep a single instance in EC2 with a small script and then punch through the GUI setup by hand. In the interest of time and as a part of larger set of partial automation projects I am working on, I decided it would be nice to get a single node install working as hands free as possible. Here we go:

The Setup

This should get you to a clean install of Ambari with no actual Hadoop packages installed.

Blueprint Registration

The balance of this code installs the actual Hadoop packages:

Now in order to run the last section a blueprint file is needed. I created this simply by installing via the Ambari GUI to get everything how I wanted it and then dumping the blueprint to JSON to allow me to replicate it.

I did edit this file slightly by hand such as reducing the hdfs replication factor. The file that it produced was FAR more detailed than anything I found here or here.

You wont see any real output when you register the blueprint. There is a process of topology validation that can be disable though. You can check however to make sure it was accepted. I suggest you do this if you are still testing the blueprint as it will fail silently if there are errors.

Cluster Install

For the next step you need a file that maps hosts to components which is fairly simple for a single node. You need a hostmapping file.

I typically add a single sed to swap out the fqdn for the internal AWS hostname in my script.

Then submit using the following to kick off the cluster install:

When you are successful there will be some output like this:

Monitoring Progress

At this point you can for sure monitor the progress via RESTful calls like this:

But its far easier at this point to just open the web interface of Ambari on port 8080. You will see your services installing and all the pretty lights turn green letting you know things are installed and ready to use.

Ambari Services Installing

Troubleshooting etc.


While I played with this I did see an issue where it seemed like things launched but when I checked the Ambari GUI I saw a message in the operations dialog that list the problem as “PENDING HOST ASSIGNMENT” which apparently means Ambari-agent was not online and or “registered”. I only saw this once and upon trying a clean new install test I didn’t see it again.

RESET Default password

Don’t be the person who runs Ambari on Amazon with the default password. Just don’t do it.

All these steps together should allow you to put together a nice script for launching a single node test cluster of your own. There are instructions for adding nodes to your cluster or starting with a more complex setup in hostmapping available in the Ambari docs. Depending upon your needs you could even capture an AMI once you are setup allowing you to launch a single version even faster. The nice part about this script is that you always get the newest HDP release.

Block placement and Multi-tenancy – Where is your data?

Most folks tend to think of the Hadoop storage layer as a large hard drive. At a high level I guess this is a fair assumption. The real issue comes to light when one considers actual block placement in Hadoop. Many architects want to design systems for multitenancy using Hadoop as a core part of their design. An HDFS system has a very particular strategy for block placement. Within a single cluster blocks cannot be restricted to a set of hosts (using a default build of HDFS) or even a set of drives within a host. This means that even though users might feel that both HDFS POSIX ACLs level permissions protect data from unauthorized access this only applies to folks using the front door. As they say locks are for honest people. Only users attempting to access your data via Hadoop based methods will be blocked. The unscrupulous could still for instance directly access nodes and therefore the blocks placed in the Linux file system accessed by Hadoop. Hadoop itself leverages the Linux file system. Encryption of this data at rest is a partial answer to this dilemma but this is a very new feature of Hadoop 3 not yet adopted by most distribution vendors. The classical answer has been to engage a third party vendors for a more robust answer to the issue of Hadoop encryption at rest. At the end of the day the block could still be access (isolating direct log in to data node helps) and theoretically decrypted. Im sure this could easily be the topic or a least a sub plot of spy movie.

The other alternative is to simply NOT use HDFS. The MapR FS for example has none of these issues because it essentially is a true file system. Block are not the unit of replication, the Namenode meta data is not an issue as its distributed and one cannot access the MapR FS nor any of its components in any way other than the front door (via the MapR client). Along with the ability to place data not only on specific nodes but also on specific drives within a node MapR FS is really a more elegant solution. For these reasons true data isolation is really guaranteed with MapR FS. With HDFS the correct answer to guarantee data isolation is really an entirely alternate HDFS subsystem (another cluster). HDFS has the concept of Namespaces but that really addresses the small files issue and isnt an enforcement method for data placement on nodes or drives. The comparison on a features by feature basis tends to leave HDFS wanting.

The lesson here is to really study what is included and how components are configured prior to engaging in the use of Hadoop for a Multi-tenant infrastructure.

Hadoop Versioning and Ecosystem Abstraction – The MapR Angle

As a follow up to my previous article on Hadoop versioning I noticed that I left off MapR. There exists a similar logic and structure to the MapR versioning again with some nuance worth explaining. When looking for the version of MapR you are using a simple cat of MAPR_HOME/MapRBuildVersion will show you what you are working with.

This is a simple was to grab the version (5.0.0), the build (32987) and the release state (GA) of the distribution of MapR in use.

But what about Ecosystem products? 

MapR categorizes the packages in a distribution into core packages and ecosystem packages. Packages in the ecosystem would include a package like Hive for example. In RPM based Linux systems one can simply use Yum to investigate whats installed or available.

Yum cleanly separates installed vs available automatically and you can see core vs ecosystem by the repo column. Again the exact verssion is displayed in a similar context.

Core products use


with X.X.X being the major version, Y is the build and Z is the release state. The ecosystem packages are slightly different in that they are listed as


where X.X.X is the version number from the corresponding Apache project and Y is the date.

All that said its always good know not only what version of package you are using but do yourself a favor and be sure to read the release notes in the documentation.

One of the best things about MapR is the abstraction of the ecosystem layer from the core distribution. What does this mean in practice? This means that you have a range of ecosystem package versions to choose from within a version of the core MapR distro. This is often overlooked but is a great example of the forethought that has done into MapR that sometimes seen as “different”. This means if you have a mission critical application running on Hive 0.13 there is NO reason to install a whole new cluster to run Hive 1.0. Ultimately this type of flexibility leaves the user in charge of their own cluster.

MapR Yarn Log Aggregation

MapR sometimes takes some criticism for being “different” which is sometimes translated as “hard to use” because it doesn’t behave like Hadoop distribution brand X that I know and love. I get that. I also know that all software (like people) has warts. Sometimes those differences are intentional. For the most part this is the case with MapR. Lots of folks misunderstand the method to the madness initially. I can say that most users once armed with some understanding tend to love MapR.

That aside I wanted to point out one such difference. It has to to with Yarn Log Aggregation. I hear repeatedly folks saying they enabled log aggregation but for some reason are still seeing a message from the History Server telling them there is a problem.

What typically is happening in this case is that although folks remembered to turn on log aggregation by setting yarn.log-aggregation-enable to true in yarn-site.xml they for about Running is important when installing or changing the configuration of your cluster. It does a handful of things but most importantly it reads xml configuration file changes. There is a log kept for under MAPR_HOME/logs/configure.log showing what has been run and when. The other problem seen is that users forget to pass -“HS” to This lets MapR know where the history server is running and on what port. Without which you will likely get error messages similar to the one above. Once completed you should be able to use the History server web interface to drill into logs OR you can use the yarn logs command line.

The configure script is a small MapR nuance but is critical for proper cluster configuration.  Any time you add or remove software or reconfigure things make sure you take a look at You will see it across all the docs for the installation of Hadoop ecosystem products in MapR.

NOSASL- Hive 1.0 with JDBC connection timeout

So there I was using the new MapR 5.0 sandbox with Hive 1.0. A quick test of JDBC connectivity using beeline

Follow by an indefinite hang. So off I go to check logs. Hive.log, HS2 log and so on. No real output of consequence. I took a look at and turned the log level to DEBUG.

Again followed by an indefinite hang. Everything appeared to be running correctly. Nothing in the logs…To google with you! I found HIVE-6852 which describes an issue on the client with the SASL setting. The work around is described as adding “;auth=noSasl” which seemed to resolve the problem. Since so many folks I know use JDBC I thought this was worth a quick writeup.

NOSASL is picked up in hive-site.xml via the hive.server2.authentication property. It has three basic options NONE, NOSASL and KERBEROS. Setting NONE for hive.server.2.authenticaion in hive-site.xml would also make auth=noSasl not required.

From a  MapR perspective there is the concept of secure mode which when enabled secures connections via maprsasl mode (which is distinct from kerberos mode). Since the sandbox is configured without secure mode or kerberos setting the authentication mode is required.

Connecting to MapR via JDBC

I recently needed to do some connection testing via JDBC to Hiveserver 2 in MapR 4.0.2 in a secure and kerberized cluster. Since this was new to me I thought it would be worthwhile to write-up my notes as I will sure forget half of this in less than a few weeks. Just in case you are needing to connect via JDBC to a Kerberized cluster here are some tips.

MapR has its own login system called maprlogin for use in a secure cluster setup that requires authentication via password or Kerberos to use the cluster.

Once completed users can act on the cluster via the command line normally via the hadoop commands (or through the Linux command line in NFS mounted MapR-FS).

For JDBC connections users must use the principal configured by the Kerberos and Hadoop Admins for the hive service (not their user principal – that is used in the previous step). See the setting “hive.server2.authentication.kerberos.principal” along with the matching keytab specified in “hive.server2.authentication.kerberos.keytab” which in MapR is typically under /opt/mapr/conf/mapr.keytab.

As a quick refresher here is what we are dealing with in terms of a Kerberos principal for a JDBC connection string:


HS2.FQDN – Hive server 2 fully qualified Domain name and port – Default is 10000

primary – username – user must exist on Hiveserver2 node. You will normally see a user principal that matches the service name – i.e., hive, yarn etc.

Instance – use a FQDN that is resolvable both forward and reverse. This is a property of Apache Hive not specific to the MapR distribution (Hive version 0.13 was used in this testing against MapR 4.0.2). Hive attempts to resolve the hostname (in green above) when connecting via Kerberos.

Realm – correct Kerberos Realm name – see your local /etc/krb5.conf or consult your Kerberos Administrator.

And here are some examples of how to connect. Beeline is a Hive service you can use that is built into Hive. There are many tools that use a similar methods of connection that can be used.

You may need to either call the beeline client with a fake user name and password or hit enter twice interactively. This is a known bug.

Java Client



HDFS ACLs – Managing Data from the ground up.

HDFS and the permissions placed upon it are the first line of authorization in Hadoop. Every other service that relies upon HDFS in Hadoop must adhere to the permissions enforced by HDFS. This is the place to start when concerned about the security of your Hadoop system. One simply needs to apply the principle of least privilege just like with everything else in IT. Many people fret about Kerberos for authorization and encryption of data at rest and in motion, all of which are worthy discussions but ignore the basic principles of the file system (permissions and quotas) as one of the easiest way to both restrict and selectively share information.

A cool new feature in HDFS inclusion of  ACLs.The full documentation is provided here but I always like to write the guide for the impatient aka quick start articles. I wont visit every detail of this feature. Using classic users and groups from Linux (POSIX) you could restrict access to files and directories in HDFS using a pretty standard permissions model with a few exceptions like the execute bit (no executable files in HDFS) and setuid/setgid but the sticky bit is still there.


Once defined, two users and two unique groups in Linux (do this on the Namenode – this is where the Namenode picks up group membership) a quick investigation of the ACLs could be  launched. Many people ask “How do I enable complex access multiple groups and users?” HDFS was recently extended to support POSIX to support more complex use cases. So when user Bob attempts to write to user Joe’s home directory access is denied. Notice the plus sign now indicating some extended ACLs have been set on a directory.

Although you can see the actual ACLs in the permission denied error a slightly closer look at the extended ACLs shows the reason more clearly:

That directory is owned by user Joe and even though Joe is in poc_group2 that group does not have write permissions. That group can list the contents of the Joe’s home but unless you are in poc_group1 with full access or poc_group2 with read access you can see anything in Joe’s home (group other is set to no access). Another cool thing to notice is use of the mask setting. One could also set mask that overrides the other settings such as:

In this case you end up with an effective setting of read only since the mask now overrides the permissions of the ACL. The other interesting point is that the “default” settings have to do with the behavior of the child objects.

So what? What does all this have to do with anything. Well for starters it’s a great way to exert basic control over your HDFS layout. If you want to stop folks from simply filling up your file system with junk this is great way to stop that (along with Quotas). Lots of folks in the market talk about adopting Hadoop as a challenge due to lack of control over the environment. Hadoop permission and ACLs are the lowest levels of control one has over what is placed in Hadoop and probably one of the least well understood.