Updating Ambari Stack Repos via REST API

If you use Ambari to deploy Hadoop you may have had occasion to need to change your the repo used after you installed. At the time of this article version 1.5.1 of Ambari requires you do this via API as it is not exposed in Ambari web. The HDP repo file placed in /etc/yum.repos.d is generated by Ambari under certain conditions. In any case this is a good review of how to use the REST interface to manipulate Ambari. The basic call includes the use of the curl command to GET and PUT items to the Ambari API layer. Want some basic information about your cluster try this:

curl -H “X-Requested-By: ambari” -X GET -u admin:XXXXX http://AMBARIHOSTNAME:8080/api/v1/clusters

Screen Shot 2014-05-20 at 1.41.59 PM

You can check your existing reponame via this command:

curl -H “X-Requested-By: ambari” -X GET -u admin:XXXXX http://AMBARIHOSTNAME:8080/api/v1/stacks2/HDP/versions/2.0.6/ operatingSystems/centos6/repositories/HDP-2.0.6 

Screen Shot 2014-05-20 at 1.42.28 PM

Then you can set this repo name to whatever name you need via this PUT command. This again is help in situations where perhaps the internal repo name has changed post install. After this call you can double check your work by rerunning the above command OR via Ambari web interface under Admin>Clusters>Repositories.

curl -H “X-Requested-By: ambari” -X PUT -u admin:XXXXX http://AMBARIHOSTNAME:8080/api/v1/stacks2/HDP/versions/2.0.6
/operatingSystems/centos6/repositories/HDP-2.0.6 -d ‘{“Repositories”: {“base_url”: “http://REPOHOSTNAMEHERE/HDP/centos6/2.x/updates/2.0.6.1/”, “verify_base_url”: false}}’

This works for all the different repos for each stack. You can play with the GET command above to explore the different options available for each installed stack.

Look Ma…Im Famous, Im in Linux Journal.

But seriously it is kind of cool that I finally made it into Linux Journal. It has been one of my long time reads and subscriptions. I have learned lots of cool tech from what I consider the original magazine in Linux. Anywho take a look at my article on Yarn in the latest edition of Linux Journal (the HPC edition too!). It’s about YARN and understanding the difference in HPC scheduling systems compared to YARN.

Screen Shot 2014-04-02 at 7.52.49 PM

Connecting Talend to Hadoop

Image

If you have not used Talend its a great tool to do ETL visually. The drag and drop WYSIWG interface (which is Eclipse basically) allows you to program in Java without knowing you are programming (see the “code” tab for your job). I have played a lot with Talend as a tool. Being that there is a free version for Big Data its easy to get this downloaded and running. When installing be sure to closely follow and understand the Java requirements in the documentation and you should be good to go. Once running you will probably want to be able to use components from the component palette (menu of icons along the right side of the screen) for Big Data operations.

While simply dragging the components to the designer pane of the job (the center of the screen) and editing the component properties for that component will work the concept of a “context” in Talend is very useful.

Screen Shot 2013-09-10 at 11.19.19 AM

You can take a look at the help file on contexts within Talend but I dont think there is a Big Data specific version so I thought I would make some notes on this. On the far left in the repository view you will see a submenu called contexts that you can control click (or right click) on to add a context. After naming it you can basically add what amounts to global variables for use in the properties of your components. I have a context called HadoopConnectivity that contains the names of the various Hadoop daemons needed by different Big Data Components.

Screen Shot 2013-09-10 at 10.00.22 AM

This includes the Job Tracker, Namenode, Templeton, Hbase, Hive, Thrift, user name as well as a data directory and a jarsdirectory.You will need to copy a number of jars to your local system from Hadoop and register them with Talend. For the Talend version I am using OpenStudio for Big Data 5.3.1r104014 I have datanucleus-core, datanucleus-rdbms, hcatalog, hive-exec, hive-metastore and javacsv. I connected this to Hortonworks Data Platofrm Sandbox version 1.3.2. You will see the requirement for these jars in the various components properties as you use them. Your list be be slightly different but he point is to simply find a local place to dump the jars and then reference the local location using a Talend context rather then constantly using a full path. You will also need to make sure you import your context to your job.This amounts to clicking on the “Contexts” tab in the job properties and clicking on the clipboard to import the contexts.

Screen Shot 2013-09-10 at 10.15.54 AM

 

Screen Shot 2013-09-10 at 10.16.37 AM

To use the context in the component I have captured a screenshot showing the HDFS component using a standard double quote and append syntax to access the global variable:

“hdfs://” + context.NN_HOST + “:8020/”

Screen Shot 2013-09-10 at 10.35.17 AM

Contexts are a really useful trick for making your the components of your job transportable across systems. The same job you create for using in a test or dev cluster (or the sandbox) can be rapidly moved to run on a new cluster by simply altering the context.

Yarn is coming and this will change the daemons used in Hadoop and naturally will change the context information but again you can have more than one context as well as more than one cluster. Contexts are a great way to rapidly abstract cluster specific details from your Talend jobs. I will follow up soon with a Yarn based post connecting Talend to Hadoop 2. There are tons of other useful tools that can be used with Hadoop. I will likely present a selection of them in subsequent blog posts.

 

Arun talks Yarn

In case you werent there here is a link to Arun Murthy one of the founders of Hortonworks talking abour Yarn at Hadoop Summit 2013. This is a topic near and dear to my heart being that I have spent a good deal of my career working in HPC under the banners of workload management but also resource negotiation across nodes.

Don’t over think Hadoop.

I was watching TV yesterday and flipped past the film Moneyball on one of the movie channels. I had seen this movie before and I have heard all about the relationship to BI and Big Data. I guess for me I saw the central theme of that movie as disruption as it relates to innovation.

The disruptive power of not really needing high end hardware to build a supercomputer has the market scrambling to force Hadoop into a model that fits many years of coaching by industry giants. You have to have super high end hardware with many layers of backup, redundancy and failsafes. You have to come in on the weekend and neglect your family to “save the day” for some silly website powering someones else’s critical functionality. Stuck in a rut of selling high end nodes the thought of converting to disposability of slave nodes combined with the resiliency of the power that large numbers of slaves brings is disruptive to the entire industry selling into modern data warehouse powered businesses. Not needing Fiber or even 10GbE means networks are smaller, less expensive and closer to disposable. No need for virtualization you say? How can this be? Virtualization is good for everything isn’t it? Its faster, more dense and cost efficient right? Just ask your vendor. They will tell you all about it. Don’t even get me started on the eternal battle of share everything versus share nothing. I have argued for share nothing for many years in classic HPC to no avail. How else can sell massive network or cluster based storage? Query the market and you will find no end to the perversion of the original intent of Hadoop. Changes to the file system, replication level, placement of data into traditional databases and placement of MapReduce over the same. Hogwash. Buy nodes. Lots of them. Cheap ones with zero features for redundancy (no you don’t need two power supplies or 18 NIC cards and shouldn’t be paying more than $5k MAX). If they break beyond repair put them in the dumpster (or may be donate them to a good cause). Don’t over think Hadoop. Start using it and get educated. It will be disruptive and cause people to fear change (including in your own company) but at this stage much like “cloud” was a few years ago if you don’t have a strategy for Hadoop in place you are going to be sitting on your couch in October watching competitive company brand X win the world series of your field.