HDFS ACLs – Managing Data from the ground up.

HDFS and the permissions placed upon it are the first line of authorization in Hadoop. Every other service that relies upon HDFS in Hadoop must adhere to the permissions enforced by HDFS. This is the place to start when concerned about the security of your Hadoop system. One simply needs to apply the principle of least privilege just like with everything else in IT. Many people fret about Kerberos for authorization and encryption of data at rest and in motion, all of which are worthy discussions but ignore the basic principles of the file system (permissions and quotas) as one of the easiest way to both restrict and selectively share information.

A cool new feature in HDFS inclusion of  ACLs.The full documentation is provided here but I always like to write the guide for the impatient aka quick start articles. I wont visit every detail of this feature. Using classic users and groups from Linux (POSIX) you could restrict access to files and directories in HDFS using a pretty standard permissions model with a few exceptions like the execute bit (no executable files in HDFS) and setuid/setgid but the sticky bit is still there.

POSIXACLS

Once defined, two users and two unique groups in Linux (do this on the Namenode – this is where the Namenode picks up group membership) a quick investigation of the ACLs could be  launched. Many people ask “How do I enable complex access multiple groups and users?” HDFS was recently extended to support POSIX to support more complex use cases. So when user Bob attempts to write to user Joe’s home directory access is denied. Notice the plus sign now indicating some extended ACLs have been set on a directory.

Although you can see the actual ACLs in the permission denied error a slightly closer look at the extended ACLs shows the reason more clearly:

That directory is owned by user Joe and even though Joe is in poc_group2 that group does not have write permissions. That group can list the contents of the Joe’s home but unless you are in poc_group1 with full access or poc_group2 with read access you can see anything in Joe’s home (group other is set to no access). Another cool thing to notice is use of the mask setting. One could also set mask that overrides the other settings such as:

In this case you end up with an effective setting of read only since the mask now overrides the permissions of the ACL. The other interesting point is that the “default” settings have to do with the behavior of the child objects.

So what? What does all this have to do with anything. Well for starters it’s a great way to exert basic control over your HDFS layout. If you want to stop folks from simply filling up your file system with junk this is great way to stop that (along with Quotas). Lots of folks in the market talk about adopting Hadoop as a challenge due to lack of control over the environment. Hadoop permission and ACLs are the lowest levels of control one has over what is placed in Hadoop and probably one of the least well understood.