Hadoop Versioning and Ecosystem Abstraction – The MapR Angle

As a follow up to my previous article on Hadoop versioning I noticed that I left off MapR. There exists a similar logic and structure to the MapR versioning again with some nuance worth explaining. When looking for the version of MapR you are using a simple cat of MAPR_HOME/MapRBuildVersion will show you what you are working with.

This is a simple was to grab the version (5.0.0), the build (32987) and the release state (GA) of the distribution of MapR in use.

But what about Ecosystem products? 

MapR categorizes the packages in a distribution into core packages and ecosystem packages. Packages in the ecosystem would include a package like Hive for example. In RPM based Linux systems one can simply use Yum to investigate whats installed or available.

Yum cleanly separates installed vs available automatically and you can see core vs ecosystem by the repo column. Again the exact verssion is displayed in a similar context.

Core products use

X.X.X.Y.Z

with X.X.X being the major version, Y is the build and Z is the release state. The ecosystem packages are slightly different in that they are listed as

X.X.X.Y

where X.X.X is the version number from the corresponding Apache project and Y is the date.

All that said its always good know not only what version of package you are using but do yourself a favor and be sure to read the release notes in the documentation.

One of the best things about MapR is the abstraction of the ecosystem layer from the core distribution. What does this mean in practice? This means that you have a range of ecosystem package versions to choose from within a version of the core MapR distro. This is often overlooked but is a great example of the forethought that has done into MapR that sometimes seen as “different”. This means if you have a mission critical application running on Hive 0.13 there is NO reason to install a whole new cluster to run Hive 1.0. Ultimately this type of flexibility leaves the user in charge of their own cluster.

MapR Yarn Log Aggregation

MapR sometimes takes some criticism for being “different” which is sometimes translated as “hard to use” because it doesn’t behave like Hadoop distribution brand X that I know and love. I get that. I also know that all software (like people) has warts. Sometimes those differences are intentional. For the most part this is the case with MapR. Lots of folks misunderstand the method to the madness initially. I can say that most users once armed with some understanding tend to love MapR.

That aside I wanted to point out one such difference. It has to to with Yarn Log Aggregation. I hear repeatedly folks saying they enabled log aggregation but for some reason are still seeing a message from the History Server telling them there is a problem.

What typically is happening in this case is that although folks remembered to turn on log aggregation by setting yarn.log-aggregation-enable to true in yarn-site.xml they for about configure.sh. Running configure.sh is important when installing or changing the configuration of your cluster. It does a handful of things but most importantly it reads xml configuration file changes. There is a log kept for configure.sh under MAPR_HOME/logs/configure.log showing what has been run and when. The other problem seen is that users forget to pass -“HS” to configure.sh. This lets MapR know where the history server is running and on what port. Without which you will likely get error messages similar to the one above. Once completed you should be able to use the History server web interface to drill into logs OR you can use the yarn logs command line.

The configure script is a small MapR nuance but is critical for proper cluster configuration.  Any time you add or remove software or reconfigure things make sure you take a look at configure.sh. You will see it across all the docs for the installation of Hadoop ecosystem products in MapR.

NOSASL- Hive 1.0 with JDBC connection timeout

So there I was using the new MapR 5.0 sandbox with Hive 1.0. A quick test of JDBC connectivity using beeline

Follow by an indefinite hang. So off I go to check logs. Hive.log, HS2 log and so on. No real output of consequence. I took a look at beeline-log4j.properties and turned the log level to DEBUG.

Again followed by an indefinite hang. Everything appeared to be running correctly. Nothing in the logs…To google with you! I found HIVE-6852 which describes an issue on the client with the SASL setting. The work around is described as adding “;auth=noSasl” which seemed to resolve the problem. Since so many folks I know use JDBC I thought this was worth a quick writeup.

NOSASL is picked up in hive-site.xml via the hive.server2.authentication property. It has three basic options NONE, NOSASL and KERBEROS. Setting NONE for hive.server.2.authenticaion in hive-site.xml would also make auth=noSasl not required.

From a  MapR perspective there is the concept of secure mode which when enabled secures connections via maprsasl mode (which is distinct from kerberos mode). Since the sandbox is configured without secure mode or kerberos setting the authentication mode is required.

DZone – The Evolution of MapReduce and Hadoop

cover Recently I authored a section of the DZone Guide for Big Data 2014. I wrote about MapReduce and the evolution of Hadoop. There is a ton of buzz in the market about speed to insight and the push toward alternative DAG engines like Spark. I see these new techs as exciting and awesome. I welcome innovation and creativity in the Hadoop market. I also temper this excitement with a bit of reality. The reality is that technology has a maturity curve. It comes out from the huge brain incubators like Google to the rest of us. Much like MapReduce, HDFS and Hbase these technologies have a long road until they grow into a vibrant open source community with dedicated developers. Then comes a stage where the project may become its own incubator project within Apache or other open source framework. Finally, you may see the tech evolve to the point that its included in a supported Hadoop distribution. Now in their efforts to compete you will see vendors clamor over one another to be the first to include hot tech brand X in their offering. I see this as competition and healthy but I will say this could lead to the inclusion of technology that we may not see removed from a distro immediately but it may just fade away. Much like RedHat many years ago with a Linux distribution, Hadoop distributions are still relatively new. Their choices are still small. Eventually, hopefully, we will see the number of projects grow to the point where vendors are actually refusing or removing projects that are not longer relevant. I’m not sure we are there yet. All that said do I think MapReduce is going to dry up and blow away? No I don’t. Do I think Spark and alternative engines will gain steam as they prove their efficiencies? Yes I do. I think most folks need to hMRandBigDataear more than one group validate scalability. I’m sure Google has tooling that works for them. Are they sharing it or better yet has it matured to the point that we mere mortals can use it? I think that remains to be seen. MapReduce and further more Hive which uses MapReduce has a HUGE ecosystem of tools that depend on Hive.Also take a look at Stinger Next to see some exciting new developments of Hive. Take a walk down the rows of tables of vendors at the next Hadoop Summit or Strata and ask them in detail which Hadoop tool their tool depends. Most likely you will eventually get to an answer that basically is Hive (or Hcatalog). Hive and MapReduce will be here for years to come if for no other reason than market penetration. Anyway, take a look at the new DZone Guide. Its has many good topics that may interest you.

Hadoop Audit Logging

A recent interesting list of customer question included a query about audit logging in Hadoop. Specifically the logging of actions in Hive such as create table actions and queries. Audit was a feature added to Hadoop some time ago. Several JIRAs addressed it including HIVE-3505 and HIVE-1948 but wasnt really addressed in terms of documentation until recently via HIVE-5988.

The enablement of detail logging is done via log4j settings in /etc/hadoop/conf/log4j.properties. Changing from WARN to INFO enables the logging.

Screen Shot 2014-01-27 at 10.06.32 AM

In Hortonwork Data Platform this occurs around line 106 in the provided log4j file. The output of which places notes around actions that happen in HDFS. Since Hive also uses HDFS actions are logged in the /var/log/hadoop/hdfs/hdfs-audit.log

Screen Shot 2014-01-27 at 10.06.53 AM

Actions such as create table, show database etc. are listed for audit logging. Parsing of this log can yield detailed reports on how data was access in Hadoop for technologies that rely on HDFS.

Yarn Scheduler Load Simulator

Quote

I worked for years in HPC. Specifically I worked in workload management on tools like PBS Professional and Platform LSF. Most of the time when we wanted to understand what was happening with an HPC scheduling system with more options than was humanly possible to comprehend we tested on live systems including tuning options and making changes until the desired behavior was achieved. It was a little more art than science at times. No matter the platform I used one thing I always wanted was a scheduling simulator to allow me to scientifically optimize the scheduler behavior.

Screen Shot 2013-09-12 at 2.55.42 PM

If you are new to Hadoop you may not have heard of Yarn. This is a part of Hadoop 2.0 and will represent the next generation of Hadoop processing including new daemons, scheduling based upon resources (no more slots),  and the possibility execute more programming paradigms than just MapReduce (i.e., MPI, Storm etc). I think most people have recognized that while MapReduce is great for certain things there is probably also room at the party for a few other guests.

There is one JIRA specifically that I have been watching that has me all worked up. Yes, its a scheduling simulator. The proposed feature would offer such novel delights as detailed costs per scheduler operation allowing tuning of the entire cluster or queues themselves. From my work history I know how hotly debated scheduling policy becomes at companies. It typically becomes a struggle between competing groups jockeying for resources on their projects. This one JIRA would move that conversation from an argument to a discussion of simulation results run to provide information to make a decision versus a political battle.

Screen Shot 2013-09-12 at 2.55.30 PM

This will probably be one of the first things I play with in Hadoop 2.0 very soon. Below is a nice video showing the current state of the project. This has me excited to say the least. I have many other ideas to help bring the knowledge of HPC schedule to Hadoop where applicable but that will have to be another blog.

Welcome to TECHtonka

Welcome to TECHtonka! This is my new site containing lots of technology and all the ramblings I can dream up. Yeah I did think of Dances with Wolves when I named the site. I thought it was a good name for a site that contained large amounts of detail about various topics in technology. Buffalo are large and they run in herds much like the technology we all use today. Welcome to TECHtonka – someday a huge site on technology but for today its my blog.

For some comic relief read more about Buffalo or watch this awesome video about the word.