I worked for years in HPC. Specifically I worked in workload management on tools like PBS Professional and Platform LSF. Most of the time when we wanted to understand what was happening with an HPC scheduling system with more options than was humanly possible to comprehend we tested on live systems including tuning options and making changes until the desired behavior was achieved. It was a little more art than science at times. No matter the platform I used one thing I always wanted was a scheduling simulator to allow me to scientifically optimize the scheduler behavior.
If you are new to Hadoop you may not have heard of Yarn. This is a part of Hadoop 2.0 and will represent the next generation of Hadoop processing including new daemons, scheduling based upon resources (no more slots), and the possibility execute more programming paradigms than just MapReduce (i.e., MPI, Storm etc). I think most people have recognized that while MapReduce is great for certain things there is probably also room at the party for a few other guests.
There is one JIRA specifically that I have been watching that has me all worked up. Yes, its a scheduling simulator. The proposed feature would offer such novel delights as detailed costs per scheduler operation allowing tuning of the entire cluster or queues themselves. From my work history I know how hotly debated scheduling policy becomes at companies. It typically becomes a struggle between competing groups jockeying for resources on their projects. This one JIRA would move that conversation from an argument to a discussion of simulation results run to provide information to make a decision versus a political battle.
This will probably be one of the first things I play with in Hadoop 2.0 very soon. Below is a nice video showing the current state of the project. This has me excited to say the least. I have many other ideas to help bring the knowledge of HPC schedule to Hadoop where applicable but that will have to be another blog.
So since I am a command line zealot I wanted to understand how to submit Hadoop jobs via Hue in the Hortonworks Sandbox. There are already a ton of tutorials included with the Sandbox that are focused on the use of Hive and other tools for some of the most popular use cases like sentiment analysis. Moving beyond those examples lets say I have my own jar files? Then what? Luckily there is a Job Designer icon that allows you to submit a whole range custom jobs and have those either run immediately or be scheduled later via Oozie. Just so I could understand the mechanics I simply jarred up WordCount. This is walked through here. In case you weren’t aware the jar HADOOP_HOME/hadoop-examples.jar contains a number of examples for you to use. Provided you have a some working code you should be able to quickly cobble together a job design in Hue that allows you to submit that code to Hadoop.
There is quite a collection of possibilities here including email, ssh, shell scripts and more. If you haven’t explored Job Designs in using Hue its worth your time. Previously Hue was only available with the Hortonworks Sandbox but is now available in fully distributed clusters as of HDP version 1.3.2 Hue can be installed manually.
UPDATE: I was reminded that Hue has been available as a tarball from http://gethue.com for some time AND has been in not only the Cloudera distro but is also a part of Apache BigTop as well. This was a simple omission on my part as I usually write articles on my blog from an HDP perspective since I use it daily. Apologies to all those folks doing great work at the Hue. I know from working in this field that people love it.
If you have not used Talend its a great tool to do ETL visually. The drag and drop WYSIWG interface (which is Eclipse basically) allows you to program in Java without knowing you are programming (see the “code” tab for your job). I have played a lot with Talend as a tool. Being that there is a free version for Big Data its easy to get this downloaded and running. When installing be sure to closely follow and understand the Java requirements in the documentation and you should be good to go. Once running you will probably want to be able to use components from the component palette (menu of icons along the right side of the screen) for Big Data operations.
While simply dragging the components to the designer pane of the job (the center of the screen) and editing the component properties for that component will work the concept of a “context” in Talend is very useful.
You can take a look at the help file on contexts within Talend but I dont think there is a Big Data specific version so I thought I would make some notes on this. On the far left in the repository view you will see a submenu called contexts that you can control click (or right click) on to add a context. After naming it you can basically add what amounts to global variables for use in the properties of your components. I have a context called HadoopConnectivity that contains the names of the various Hadoop daemons needed by different Big Data Components.
This includes the Job Tracker, Namenode, Templeton, Hbase, Hive, Thrift, user name as well as a data directory and a jarsdirectory.You will need to copy a number of jars to your local system from Hadoop and register them with Talend. For the Talend version I am using OpenStudio for Big Data 5.3.1r104014 I have datanucleus-core, datanucleus-rdbms, hcatalog, hive-exec, hive-metastore and javacsv. I connected this to Hortonworks Data Platofrm Sandbox version 1.3.2. You will see the requirement for these jars in the various components properties as you use them. Your list be be slightly different but he point is to simply find a local place to dump the jars and then reference the local location using a Talend context rather then constantly using a full path. You will also need to make sure you import your context to your job.This amounts to clicking on the “Contexts” tab in the job properties and clicking on the clipboard to import the contexts.
To use the context in the component I have captured a screenshot showing the HDFS component using a standard double quote and append syntax to access the global variable:
“hdfs://” + context.NN_HOST + “:8020/”
Contexts are a really useful trick for making your the components of your job transportable across systems. The same job you create for using in a test or dev cluster (or the sandbox) can be rapidly moved to run on a new cluster by simply altering the context.
Yarn is coming and this will change the daemons used in Hadoop and naturally will change the context information but again you can have more than one context as well as more than one cluster. Contexts are a great way to rapidly abstract cluster specific details from your Talend jobs. I will follow up soon with a Yarn based post connecting Talend to Hadoop 2. There are tons of other useful tools that can be used with Hadoop. I will likely present a selection of them in subsequent blog posts.