Hadoop Client Setup for Azure Based Edge Node accessing ADLS

In my work I have the need to try unusual things. Today I needed to setup a hadoop client aka “edge node” that was capable of of access ADLS in Azure. While I’m sure this is documented somewhere here is what did.

Repo Config

In order to obtain the packages I needed that would allow me to access ADLS I went to Hortonworks docs. Based upon the fact that HDInsights 3.6 maps to HDP 2.6 I was fairly sure the repo link for Hortonworks was the way to go. Today I used Ubuntu 16.04 so I grabbed the appropriate link for the HDP repo file.  I then was able to follow the appropriate instructions in HDP docs for Ubuntu.

Next its important to Configure the correct core-site.xml. The version that comes from just installing hadoop-client is essentially blank and the version that is required to access ADLS is special. Here is a copy my core-site with critical pieces redacted. You also need to generate a service principal and add it to your Azure account in order to complete these values.

Once that is in place you should be able to use “adl:///” URI against your Azure ADLS at the command line. The nice part about ADLS is that is more Hadoop like than your average Azure BLOB store.

You can now use this by hand and in scripts as needed. You can also access ADLS programmatically via REST API.

Automated Ambari Install

I routinely have need for a single node install of HDP for integration work. Of course I could script all of that by hand and not use a cluster manager but what would I have to blog about. I routinely prep a single instance in EC2 with a small script and then punch through the GUI setup by hand. In the interest of time and as a part of larger set of partial automation projects I am working on, I decided it would be nice to get a single node install working as hands free as possible. Here we go:

The Setup

This should get you to a clean install of Ambari with no actual Hadoop packages installed.

Blueprint Registration

The balance of this code installs the actual Hadoop packages:

Now in order to run the last section a blueprint file is needed. I created this simply by installing via the Ambari GUI to get everything how I wanted it and then dumping the blueprint to JSON to allow me to replicate it.

I did edit this file slightly by hand such as reducing the hdfs replication factor. The file that it produced was FAR more detailed than anything I found here or here.

You wont see any real output when you register the blueprint. There is a process of topology validation that can be disable though. You can check however to make sure it was accepted. I suggest you do this if you are still testing the blueprint as it will fail silently if there are errors.

Cluster Install

For the next step you need a file that maps hosts to components which is fairly simple for a single node. You need a hostmapping file.

I typically add a single sed to swap out the fqdn for the internal AWS hostname in my script.

Then submit using the following to kick off the cluster install:

When you are successful there will be some output like this:

Monitoring Progress

At this point you can for sure monitor the progress via RESTful calls like this:

But its far easier at this point to just open the web interface of Ambari on port 8080. You will see your services installing and all the pretty lights turn green letting you know things are installed and ready to use.

Ambari Services Installing

Troubleshooting etc.

PENDING HOST ASSIGNMENT

While I played with this I did see an issue where it seemed like things launched but when I checked the Ambari GUI I saw a message in the operations dialog that list the problem as “PENDING HOST ASSIGNMENT” which apparently means Ambari-agent was not online and or “registered”. I only saw this once and upon trying a clean new install test I didn’t see it again.

RESET Default password

Don’t be the person who runs Ambari on Amazon with the default password. Just don’t do it.

All these steps together should allow you to put together a nice script for launching a single node test cluster of your own. There are instructions for adding nodes to your cluster or starting with a more complex setup in hostmapping available in the Ambari docs. Depending upon your needs you could even capture an AMI once you are setup allowing you to launch a single version even faster. The nice part about this script is that you always get the newest HDP release.

Hue Job Designs

So since I am a command line zealot I wanted to understand how to submit Hadoop jobs via Hue in the Hortonworks Sandbox. There are already a ton of tutorials included with the Sandbox that are focused on the use of Hive and other tools for some of the most popular use cases like sentiment analysis. Moving beyond those examples lets say I have my own jar files? Then what? Luckily there is a Job Designer icon that allows you to submit a whole range custom jobs and have those either run immediately or be scheduled later via Oozie. Just so I could understand the mechanics I simply jarred up WordCount. This is walked through here.  In case you weren’t aware the jar HADOOP_HOME/hadoop-examples.jar contains a number of examples for you to use. Provided you have a some working code you should be able to quickly cobble together a job design in Hue that allows you to submit that code to Hadoop.

Screen Shot 2013-09-11 at 4.23.38 PM

 

There is quite a collection of possibilities here including email, ssh, shell scripts and more. If you haven’t explored Job Designs in using Hue its worth your time. Previously Hue was only available with the Hortonworks Sandbox but is now available in fully distributed clusters as of HDP version 1.3.2 Hue can be installed manually.

UPDATE: I was reminded that Hue has been available as a tarball from http://gethue.com for some time AND has been in not only the Cloudera distro but is also a part of Apache BigTop as well. This was a simple omission on my part as I usually write articles on my blog from an HDP perspective since I use it daily. Apologies to all those folks doing great work at the Hue. I know from working in this field that people love it.

First Things I do in the Hortonworks Sandbox

The Hortonworks Sandbox has been out for some time now and over that time a few versions have been released. I use the sandbox for demos, POCs and development work. In the interest of always using the latest and greatest I have had the opportunity to install a few new versions.  A pattern has emerged for me of changes and things that I do with the sandbox when a new version comes out. I thought sharing this list might be useful to others:

1. Snapshot it – I run a Mac Pro so I use VMware Fusion. One of the first things I do is snapshot my installation right after I install. I also make sure I take snapshots at key breakpoints over the working life of the instance including after installing packages and bookending projects. I like to be able to make all the changes I would like while experimenting with the sandbox and simply backing out of those changes quickly.

2. Modify the Network – The sandbox used to come preconfigured with two network adapters which was later reduced to one. Of course its up to you how you accomplish connecting the sandbox in general but I have found I like having my primary adapter to be on a host only network while adding a second adapter for external connections. This allows me to not only update the sandbox with new tutorials but also connect to various online repos and pull packages down via the command line as I see fit. I like to choose a static IP for my sandbox and make an entry in my local hosts file for name resolution that is consistent over time.

3. Install additional packages – I usually have a list of thing I like to use and install. This list might vary for you but some of my favorites include mlocate, R along with the R Hadoop Libraries, Mahout, flume, elastic search, Kibana. If you have other linux nodes you want to harness as a group I also recommend pdsh.

4. Swap SSH Keys – I dont like to keep typing passwords so I always make sure I create keys for user root and swap them with my host OS public key. Be sure to also change the default root password from “hadoop”.

5. Enable Ambari – There is a small script in home directory of root that you can run to enable Amabri called “start_ambari.sh”. Run this script and reboot. You will then have Ambari available at http://hostname:8080 with username admin and password admin while the standard Hue interface is available at http://hostname.

6. Check for new Tutorials – You can click the “About Hortonwork Hue” icon in the upper left hand corner and then click “Update” on the resulting page. You can check this every so often to make sure you have all the latest tutorials.

7. Enable NFS to HDFS – This is a little more involved but is possible. I will have a blog entry on the Hortonworks main site detailing the steps involved and I will like to it here. This gives you the ability to mount HDFS as an NFS mountable directory to your local workstation. This isnt really made for a transferring data at scale but is another very hand option up to a point.

8. Increase the amount of available memory – This is a no brainer. Turn up the amount of memory available to the sandbox to make your life easier. I have 16GB on my laptop so I have plenty to spare. If you don’t then try to find out if you can host this virtual somewhere with more memory available if possible. Lots of times administrators don’t mind giving you space to run a small VM like this. Try running the built in Hadoop benchmarks as you increase the hardware specs and see what happens.

9. Change the Ambari admin password – The default Ambari login is username Ambari with password set to Ambari. Make sure you change this immediately.

10. Add users including HDFS users – Its Linux so you can simply use “adduser” to add OS level users. Also add HDFS users and add a quotas. You can then simply use hadoop-create-user.sh to add your hadoop users.

11. Connect Clients – I run a collection of clients including Talend Open Studio for Big Data, Tableau and Microsoft Excel powered by the Hortonworks ODBC driver. All these are pretty detailed and probably worthy of additional blog entries on each.

There are probably more things I am forgetting here but this is a good list of the basics that I touch when installing a new version of the Hortonworks Sandbox.