Hadoop Versioning and Ecosystem Abstraction – The MapR Angle

As a follow up to my previous article on Hadoop versioning I noticed that I left off MapR. There exists a similar logic and structure to the MapR versioning again with some nuance worth explaining. When looking for the version of MapR you are using a simple cat of MAPR_HOME/MapRBuildVersion will show you what you are working with.

This is a simple was to grab the version (5.0.0), the build (32987) and the release state (GA) of the distribution of MapR in use.

But what about Ecosystem products? 

MapR categorizes the packages in a distribution into core packages and ecosystem packages. Packages in the ecosystem would include a package like Hive for example. In RPM based Linux systems one can simply use Yum to investigate whats installed or available.

Yum cleanly separates installed vs available automatically and you can see core vs ecosystem by the repo column. Again the exact verssion is displayed in a similar context.

Core products use

X.X.X.Y.Z

with X.X.X being the major version, Y is the build and Z is the release state. The ecosystem packages are slightly different in that they are listed as

X.X.X.Y

where X.X.X is the version number from the corresponding Apache project and Y is the date.

All that said its always good know not only what version of package you are using but do yourself a favor and be sure to read the release notes in the documentation.

One of the best things about MapR is the abstraction of the ecosystem layer from the core distribution. What does this mean in practice? This means that you have a range of ecosystem package versions to choose from within a version of the core MapR distro. This is often overlooked but is a great example of the forethought that has done into MapR that sometimes seen as “different”. This means if you have a mission critical application running on Hive 0.13 there is NO reason to install a whole new cluster to run Hive 1.0. Ultimately this type of flexibility leaves the user in charge of their own cluster.

MapR Yarn Log Aggregation

MapR sometimes takes some criticism for being “different” which is sometimes translated as “hard to use” because it doesn’t behave like Hadoop distribution brand X that I know and love. I get that. I also know that all software (like people) has warts. Sometimes those differences are intentional. For the most part this is the case with MapR. Lots of folks misunderstand the method to the madness initially. I can say that most users once armed with some understanding tend to love MapR.

That aside I wanted to point out one such difference. It has to to with Yarn Log Aggregation. I hear repeatedly folks saying they enabled log aggregation but for some reason are still seeing a message from the History Server telling them there is a problem.

What typically is happening in this case is that although folks remembered to turn on log aggregation by setting yarn.log-aggregation-enable to true in yarn-site.xml they for about configure.sh. Running configure.sh is important when installing or changing the configuration of your cluster. It does a handful of things but most importantly it reads xml configuration file changes. There is a log kept for configure.sh under MAPR_HOME/logs/configure.log showing what has been run and when. The other problem seen is that users forget to pass -“HS” to configure.sh. This lets MapR know where the history server is running and on what port. Without which you will likely get error messages similar to the one above. Once completed you should be able to use the History server web interface to drill into logs OR you can use the yarn logs command line.

The configure script is a small MapR nuance but is critical for proper cluster configuration.  Any time you add or remove software or reconfigure things make sure you take a look at configure.sh. You will see it across all the docs for the installation of Hadoop ecosystem products in MapR.