Connecting Talend to Hadoop

If you have not used Talend its a great tool to do ETL visually. The drag and drop WYSIWG interface (which is Eclipse basically) allows you to program in Java without knowing you are programming (see the “code” tab for your job). I have played a lot with Talend as a tool. Being that there is a free version for Big Data its easy to get this downloaded and running. When installing be sure to closely follow and understand the Java requirements in the documentation and you should be good to go. Once running you will probably want to be able to use components from the component palette (menu of icons along the right side of the screen) for Big Data operations.

While simply dragging the components to the designer pane of the job (the center of the screen) and editing the component properties for that component will work the concept of a “context” in Talend is very useful.

Screen Shot 2013-09-10 at 11.19.19 AM

You can take a look at the help file on contexts within Talend but I dont think there is a Big Data specific version so I thought I would make some notes on this. On the far left in the repository view you will see a submenu called contexts that you can control click (or right click) on to add a context. After naming it you can basically add what amounts to global variables for use in the properties of your components. I have a context called HadoopConnectivity that contains the names of the various Hadoop daemons needed by different Big Data Components.

Screen Shot 2013-09-10 at 10.00.22 AM

This includes the Job Tracker, Namenode, Templeton, Hbase, Hive, Thrift, user name as well as a data directory and a jarsdirectory.You will need to copy a number of jars to your local system from Hadoop and register them with Talend. For the Talend version I am using OpenStudio for Big Data 5.3.1r104014 I have datanucleus-core, datanucleus-rdbms, hcatalog, hive-exec, hive-metastore and javacsv. I connected this to Hortonworks Data Platofrm Sandbox version 1.3.2. You will see the requirement for these jars in the various components properties as you use them. Your list be be slightly different but he point is to simply find a local place to dump the jars and then reference the local location using a Talend context rather then constantly using a full path. You will also need to make sure you import your context to your job.This amounts to clicking on the “Contexts” tab in the job properties and clicking on the clipboard to import the contexts.

Screen Shot 2013-09-10 at 10.15.54 AM

 

Screen Shot 2013-09-10 at 10.16.37 AM

To use the context in the component I have captured a screenshot showing the HDFS component using a standard double quote and append syntax to access the global variable:

“hdfs://” + context.NN_HOST + “:8020/”

Screen Shot 2013-09-10 at 10.35.17 AM

Contexts are a really useful trick for making your the components of your job transportable across systems. The same job you create for using in a test or dev cluster (or the sandbox) can be rapidly moved to run on a new cluster by simply altering the context.

Yarn is coming and this will change the daemons used in Hadoop and naturally will change the context information but again you can have more than one context as well as more than one cluster. Contexts are a great way to rapidly abstract cluster specific details from your Talend jobs. I will follow up soon with a Yarn based post connecting Talend to Hadoop 2. There are tons of other useful tools that can be used with Hadoop. I will likely present a selection of them in subsequent blog posts.