Yarn Scheduler Load Simulator

Quote

I worked for years in HPC. Specifically I worked in workload management on tools like PBS Professional and Platform LSF. Most of the time when we wanted to understand what was happening with an HPC scheduling system with more options than was humanly possible to comprehend we tested on live systems including tuning options and making changes until the desired behavior was achieved. It was a little more art than science at times. No matter the platform I used one thing I always wanted was a scheduling simulator to allow me to scientifically optimize the scheduler behavior.

Screen Shot 2013-09-12 at 2.55.42 PM

If you are new to Hadoop you may not have heard of Yarn. This is a part of Hadoop 2.0 and will represent the next generation of Hadoop processing including new daemons, scheduling based upon resources (no more slots),  and the possibility execute more programming paradigms than just MapReduce (i.e., MPI, Storm etc). I think most people have recognized that while MapReduce is great for certain things there is probably also room at the party for a few other guests.

There is one JIRA specifically that I have been watching that has me all worked up. Yes, its a scheduling simulator. The proposed feature would offer such novel delights as detailed costs per scheduler operation allowing tuning of the entire cluster or queues themselves. From my work history I know how hotly debated scheduling policy becomes at companies. It typically becomes a struggle between competing groups jockeying for resources on their projects. This one JIRA would move that conversation from an argument to a discussion of simulation results run to provide information to make a decision versus a political battle.

Screen Shot 2013-09-12 at 2.55.30 PM

This will probably be one of the first things I play with in Hadoop 2.0 very soon. Below is a nice video showing the current state of the project. This has me excited to say the least. I have many other ideas to help bring the knowledge of HPC schedule to Hadoop where applicable but that will have to be another blog.

Hue Job Designs

So since I am a command line zealot I wanted to understand how to submit Hadoop jobs via Hue in the Hortonworks Sandbox. There are already a ton of tutorials included with the Sandbox that are focused on the use of Hive and other tools for some of the most popular use cases like sentiment analysis. Moving beyond those examples lets say I have my own jar files? Then what? Luckily there is a Job Designer icon that allows you to submit a whole range custom jobs and have those either run immediately or be scheduled later via Oozie. Just so I could understand the mechanics I simply jarred up WordCount. This is walked through here.  In case you weren’t aware the jar HADOOP_HOME/hadoop-examples.jar contains a number of examples for you to use. Provided you have a some working code you should be able to quickly cobble together a job design in Hue that allows you to submit that code to Hadoop.

Screen Shot 2013-09-11 at 4.23.38 PM

 

There is quite a collection of possibilities here including email, ssh, shell scripts and more. If you haven’t explored Job Designs in using Hue its worth your time. Previously Hue was only available with the Hortonworks Sandbox but is now available in fully distributed clusters as of HDP version 1.3.2 Hue can be installed manually.

UPDATE: I was reminded that Hue has been available as a tarball from http://gethue.com for some time AND has been in not only the Cloudera distro but is also a part of Apache BigTop as well. This was a simple omission on my part as I usually write articles on my blog from an HDP perspective since I use it daily. Apologies to all those folks doing great work at the Hue. I know from working in this field that people love it.

Connecting Talend to Hadoop

Image

If you have not used Talend its a great tool to do ETL visually. The drag and drop WYSIWG interface (which is Eclipse basically) allows you to program in Java without knowing you are programming (see the “code” tab for your job). I have played a lot with Talend as a tool. Being that there is a free version for Big Data its easy to get this downloaded and running. When installing be sure to closely follow and understand the Java requirements in the documentation and you should be good to go. Once running you will probably want to be able to use components from the component palette (menu of icons along the right side of the screen) for Big Data operations.

While simply dragging the components to the designer pane of the job (the center of the screen) and editing the component properties for that component will work the concept of a “context” in Talend is very useful.

Screen Shot 2013-09-10 at 11.19.19 AM

You can take a look at the help file on contexts within Talend but I dont think there is a Big Data specific version so I thought I would make some notes on this. On the far left in the repository view you will see a submenu called contexts that you can control click (or right click) on to add a context. After naming it you can basically add what amounts to global variables for use in the properties of your components. I have a context called HadoopConnectivity that contains the names of the various Hadoop daemons needed by different Big Data Components.

Screen Shot 2013-09-10 at 10.00.22 AM

This includes the Job Tracker, Namenode, Templeton, Hbase, Hive, Thrift, user name as well as a data directory and a jarsdirectory.You will need to copy a number of jars to your local system from Hadoop and register them with Talend. For the Talend version I am using OpenStudio for Big Data 5.3.1r104014 I have datanucleus-core, datanucleus-rdbms, hcatalog, hive-exec, hive-metastore and javacsv. I connected this to Hortonworks Data Platofrm Sandbox version 1.3.2. You will see the requirement for these jars in the various components properties as you use them. Your list be be slightly different but he point is to simply find a local place to dump the jars and then reference the local location using a Talend context rather then constantly using a full path. You will also need to make sure you import your context to your job.This amounts to clicking on the “Contexts” tab in the job properties and clicking on the clipboard to import the contexts.

Screen Shot 2013-09-10 at 10.15.54 AM

 

Screen Shot 2013-09-10 at 10.16.37 AM

To use the context in the component I have captured a screenshot showing the HDFS component using a standard double quote and append syntax to access the global variable:

“hdfs://” + context.NN_HOST + “:8020/”

Screen Shot 2013-09-10 at 10.35.17 AM

Contexts are a really useful trick for making your the components of your job transportable across systems. The same job you create for using in a test or dev cluster (or the sandbox) can be rapidly moved to run on a new cluster by simply altering the context.

Yarn is coming and this will change the daemons used in Hadoop and naturally will change the context information but again you can have more than one context as well as more than one cluster. Contexts are a great way to rapidly abstract cluster specific details from your Talend jobs. I will follow up soon with a Yarn based post connecting Talend to Hadoop 2. There are tons of other useful tools that can be used with Hadoop. I will likely present a selection of them in subsequent blog posts.

 

First Things I do in the Hortonworks Sandbox

The Hortonworks Sandbox has been out for some time now and over that time a few versions have been released. I use the sandbox for demos, POCs and development work. In the interest of always using the latest and greatest I have had the opportunity to install a few new versions.  A pattern has emerged for me of changes and things that I do with the sandbox when a new version comes out. I thought sharing this list might be useful to others:

1. Snapshot it – I run a Mac Pro so I use VMware Fusion. One of the first things I do is snapshot my installation right after I install. I also make sure I take snapshots at key breakpoints over the working life of the instance including after installing packages and bookending projects. I like to be able to make all the changes I would like while experimenting with the sandbox and simply backing out of those changes quickly.

2. Modify the Network – The sandbox used to come preconfigured with two network adapters which was later reduced to one. Of course its up to you how you accomplish connecting the sandbox in general but I have found I like having my primary adapter to be on a host only network while adding a second adapter for external connections. This allows me to not only update the sandbox with new tutorials but also connect to various online repos and pull packages down via the command line as I see fit. I like to choose a static IP for my sandbox and make an entry in my local hosts file for name resolution that is consistent over time.

3. Install additional packages – I usually have a list of thing I like to use and install. This list might vary for you but some of my favorites include mlocate, R along with the R Hadoop Libraries, Mahout, flume, elastic search, Kibana. If you have other linux nodes you want to harness as a group I also recommend pdsh.

4. Swap SSH Keys – I dont like to keep typing passwords so I always make sure I create keys for user root and swap them with my host OS public key. Be sure to also change the default root password from “hadoop”.

5. Enable Ambari – There is a small script in home directory of root that you can run to enable Amabri called “start_ambari.sh”. Run this script and reboot. You will then have Ambari available at http://hostname:8080 with username admin and password admin while the standard Hue interface is available at http://hostname.

6. Check for new Tutorials – You can click the “About Hortonwork Hue” icon in the upper left hand corner and then click “Update” on the resulting page. You can check this every so often to make sure you have all the latest tutorials.

7. Enable NFS to HDFS – This is a little more involved but is possible. I will have a blog entry on the Hortonworks main site detailing the steps involved and I will like to it here. This gives you the ability to mount HDFS as an NFS mountable directory to your local workstation. This isnt really made for a transferring data at scale but is another very hand option up to a point.

8. Increase the amount of available memory – This is a no brainer. Turn up the amount of memory available to the sandbox to make your life easier. I have 16GB on my laptop so I have plenty to spare. If you don’t then try to find out if you can host this virtual somewhere with more memory available if possible. Lots of times administrators don’t mind giving you space to run a small VM like this. Try running the built in Hadoop benchmarks as you increase the hardware specs and see what happens.

9. Change the Ambari admin password – The default Ambari login is username Ambari with password set to Ambari. Make sure you change this immediately.

10. Add users including HDFS users – Its Linux so you can simply use “adduser” to add OS level users. Also add HDFS users and add a quotas. You can then simply use hadoop-create-user.sh to add your hadoop users.

11. Connect Clients – I run a collection of clients including Talend Open Studio for Big Data, Tableau and Microsoft Excel powered by the Hortonworks ODBC driver. All these are pretty detailed and probably worthy of additional blog entries on each.

There are probably more things I am forgetting here but this is a good list of the basics that I touch when installing a new version of the Hortonworks Sandbox.

Arun talks Yarn

In case you werent there here is a link to Arun Murthy one of the founders of Hortonworks talking abour Yarn at Hadoop Summit 2013. This is a topic near and dear to my heart being that I have spent a good deal of my career working in HPC under the banners of workload management but also resource negotiation across nodes.

Don’t over think Hadoop.

I was watching TV yesterday and flipped past the film Moneyball on one of the movie channels. I had seen this movie before and I have heard all about the relationship to BI and Big Data. I guess for me I saw the central theme of that movie as disruption as it relates to innovation.

The disruptive power of not really needing high end hardware to build a supercomputer has the market scrambling to force Hadoop into a model that fits many years of coaching by industry giants. You have to have super high end hardware with many layers of backup, redundancy and failsafes. You have to come in on the weekend and neglect your family to “save the day” for some silly website powering someones else’s critical functionality. Stuck in a rut of selling high end nodes the thought of converting to disposability of slave nodes combined with the resiliency of the power that large numbers of slaves brings is disruptive to the entire industry selling into modern data warehouse powered businesses. Not needing Fiber or even 10GbE means networks are smaller, less expensive and closer to disposable. No need for virtualization you say? How can this be? Virtualization is good for everything isn’t it? Its faster, more dense and cost efficient right? Just ask your vendor. They will tell you all about it. Don’t even get me started on the eternal battle of share everything versus share nothing. I have argued for share nothing for many years in classic HPC to no avail. How else can sell massive network or cluster based storage? Query the market and you will find no end to the perversion of the original intent of Hadoop. Changes to the file system, replication level, placement of data into traditional databases and placement of MapReduce over the same. Hogwash. Buy nodes. Lots of them. Cheap ones with zero features for redundancy (no you don’t need two power supplies or 18 NIC cards and shouldn’t be paying more than $5k MAX). If they break beyond repair put them in the dumpster (or may be donate them to a good cause). Don’t over think Hadoop. Start using it and get educated. It will be disruptive and cause people to fear change (including in your own company) but at this stage much like “cloud” was a few years ago if you don’t have a strategy for Hadoop in place you are going to be sitting on your couch in October watching competitive company brand X win the world series of your field.

pdsh – this is but one of my favorite things

Do you create crazy for loops to deal with lots of hosts? I dont. I like simple. My middle school math teacher once told me that the best mathematician is a lazy mathematician. I thought, wow I should be the best at math then. Seriously though, I help lots of folks who are new to distributed computing. Standing over their shoulders I watch as they usually struggle with lots of concepts and best practices once things are spread out horizontally. Probably the very first thing someone is faced with on a daily basis is “How do I interact with all these systems?” One answer and the topic of a previous blog entry is to use multiple terminals or tabbed terminals but this only goes so far. For my money pdsh has taken me a long way. I have no intention of describing everything it can do but suffice it to say its worth your time getting to know this tool and another related tool pdcp. In short these two tools will let you run commands and copy files across groups of nodes defined dedicated group files or using a simple regular expression type pattern at the command line. A simple linux alias ( alias mypdsh=’pdsh -w nodes[1-4]’ )is alway helpful too.

Terminator – no not the movie.

Yes I love AAAnold too but this about terminals not terminators. Do you use lots of tabbed terminals and end up flipping back and forth. Terminator can make your life much easier. I was able to use this on Ubuntu with just a few commands but on Mac it took a little more doing in that I had to install Fink. Fink worked whereas the parallel operation on mac ports did not work. In short you can split a single terminal into multiple windows using a few simple commands. If you work on lots of systems simultaneously this is a no brainer. This along with pdsh can make your life much easier at the command line.

Screen Shot 2013-05-03 at 12.36.52 PM

Welcome to TECHtonka

Welcome to TECHtonka! This is my new site containing lots of technology and all the ramblings I can dream up. Yeah I did think of Dances with Wolves when I named the site. I thought it was a good name for a site that contained large amounts of detail about various topics in technology. Buffalo are large and they run in herds much like the technology we all use today. Welcome to TECHtonka – someday a huge site on technology but for today its my blog.

For some comic relief read more about Buffalo or watch this awesome video about the word.