Block placement and Multi-tenancy – Where is your data?

Most folks tend to think of the Hadoop storage layer as a large hard drive. At a high level I guess this is a fair assumption. The real issue comes to light when one considers actual block placement in Hadoop. Many architects want to design systems for multitenancy using Hadoop as a core part of their design. An HDFS system has a very particular strategy for block placement. Within a single cluster blocks cannot be restricted to a set of hosts (using a default build of HDFS) or even a set of drives within a host. This means that even though users might feel that both HDFS POSIX ACLs level permissions protect data from unauthorized access this only applies to folks using the front door. As they say locks are for honest people. Only users attempting to access your data via Hadoop based methods will be blocked. The unscrupulous could still for instance directly access nodes and therefore the blocks placed in the Linux file system accessed by Hadoop. Hadoop itself leverages the Linux file system. Encryption of this data at rest is a partial answer to this dilemma but this is a very new feature of Hadoop 3 not yet adopted by most distribution vendors. The classical answer has been to engage a third party vendors for a more robust answer to the issue of Hadoop encryption at rest. At the end of the day the block could still be access (isolating direct log in to data node helps) and theoretically decrypted. Im sure this could easily be the topic or a least a sub plot of spy movie.

The other alternative is to simply NOT use HDFS. The MapR FS for example has none of these issues because it essentially is a true file system. Block are not the unit of replication, the Namenode meta data is not an issue as its distributed and one cannot access the MapR FS nor any of its components in any way other than the front door (via the MapR client). Along with the ability to place data not only on specific nodes but also on specific drives within a node MapR FS is really a more elegant solution. For these reasons true data isolation is really guaranteed with MapR FS. With HDFS the correct answer to guarantee data isolation is really an entirely alternate HDFS subsystem (another cluster). HDFS has the concept of Namespaces but that really addresses the small files issue and isnt an enforcement method for data placement on nodes or drives. The comparison on a features by feature basis tends to leave HDFS wanting.

The lesson here is to really study what is included and how components are configured prior to engaging in the use of Hadoop for a Multi-tenant infrastructure.