Basics of Hadoop User Management

Hadoop is old, everyone has their own Hadoop cluster and everyone knows how to use it. It’s 2018, right? This article is just a collection of a few gotchas, dos and don’ts with respect to User Management that shouldn’t happen in 2018 anymore.

Terminology

Just a few terms and definitions so that everyone is on the same page for the rest of the article. Roll your eyes and skip that section if you are an advanced user.

  • OS user = user that is provisioned on the operating system level of the nodes of a Hadoop cluster. You can check if the user exists on OS level by doing

id <username>

  • KDC = Key Distribution Center. This might be a standalone KDC implementation, such as the MIT KDC or an integrated one behind a Microsoft Active Directory.
  • Keytab = file that stores the encrypted password of a user provisioned in a KDC. Can be used to authenticate without the need of typing the password using the “kinit” command line tool.

Do

  • Make sure your users are available on all nodes of the OS, as well as in the KDC. This is important for several reasons:
    • When you run a job, the job might create staging/temporary directories in the /tmp/ directory, which are owned by the user running the job. The name of the directory is the name of the OS user, while the ownership belongs to authenticated user. In a secure cluster the authenticated user is the user you obtained a Kerberos ticket for from the KDC.
    • Keytabs on OS level should be only readable by the user OS user who is supposed to authenticate with them for security purposes.
    • When impersonation is turned on for services, e.g., Oozie using the -doas tag, Hive using the property hive.server2.enable.doAs=True property or Storm using the supervisor.run.worker.as.user=true  property, a user authenticated as a principal will run on the OS level as a processs owned by that user. If that user is not known to the OS, the job will fail (to start).

Don’t

  • Don’t use the hdfs user to run jobs on YARN (it’s forbidden by default and don’t change that configuration). Your problem can be solved in a different way! Only use the hdfs user for administrative tasks on the command line.
  • Don’t run Hive jobs as the “hive” user. The “hive” user is the administrative user and if at all should only be used by the Hadoop/database administrator.
  • Or in general: Don’t use the <service name> user to do  <operation> on <service name>. You saw that coming, hm?

How to Achieve Synchronisation of KDC and OS Level

(…or other user/group management systems). This is a tricky one, if you don’t want to run into a split brain situation, where one system knows one set of users and another one knows others, which may or may not overlap.

  • Automate user provisioning, e.g., by using an Ansible role that provisions a user in the KDC and on all nodes of the Hadoop cluster.
  • Use services such as SSSD (System Security Services Daemon) that integrates users and groups from user and group management services into the operating system. So you won’t need to actually add them to each node, as long as SSSD is up and running.
  • Manually create OS users on all nodes and in the KDC (don’t do that, obviously ;P )

 

Maybe I’ll expand that list in the future 🙂

Hortonworks HDP Admin Certification Preparation

I recently received the Hortonworks Admin Certification and wrote down the most important steps to get certified as well as my experience in this blog entry.

Types of Certification Offered By Hortonworks

At the moment Hortonworks offers 3 certifications, more of which will follow as their training offering will grow.

  • HDP Developer (Java)
  • HDP Developer (Pig, Hive)
  • HDP Administrator

How to Prepare for the Exam

Here are a few links that helped me prepare for the exam and some additional information.

https://hortonworks.com/wp-content/uploads/2015/02/HDPCD-PracticeExamGuide1.pdf

This is the guide to a preparation exam that has similar conditions as in the real one. Follow the instructions here and check if you can finish the tasks. The AWS image you should use is the one for the admin exam. This guide describes the procedure for the developer exam. However, when you search for the image, you can find the admin one easily: Choose the one that suits in step 3. The test exam tasks can easily be found on the website.

http://hortonworks.com/training/class/hdp-certified-administrator-hdpca-exam/

This is general information on the HDP certified admin exam and the very basic you need to read. At the end of this page you find several links to documentation. Work through these tasks, understand what you are doing and why you are doing it. Also, make sure you can find these documentation pages by yourself. During the exam you will have access to the internet. That includes the HDP documentation. However, certain websites are blocked, such as the official documentation of Apache Hadoop and others. The best way to go is to learn how to quickly navigate through the HDP documentation. And the best way to learn that is to actively use HDP and its documentation. (http://docs.hortonworks.com/index.html)

Registration Process

Enter https://www.examslocal.com and search for your preferred exam. Register on any available day. Dates are available Monday – Sunday all around the clock can be booked and cancelled up to 24 hours before you want to start the exam. That’s actually really cool.

My Story – All Good Things Come in Threes

It took me three (!) attempts to get HDP Certified Administrator. Here, you see why. This might also be interesting for you if you want to know in detail how the exam procedure works.

Preparation

The first attempt I was quite nervous. I was ready to go 15 minutes before the exam actually started. I logged into the exam’s online tool and waited until the countdown slowly went down. Then finally, the exam proctor arrived, I connected remotely, started to publish my webcam stream and shared my screen. Then, the guy on the other side of the world – it was midnight at his place, he told me – asked me to perform a few steps to ensure that I was not cheating. He asked me to turn my webcam to make sure that there are no other people close by*. Then he asked me to show him my passport. Before the exam finally could start, he asked me to open my computer’s task manager.

The Exam Starts

First of all I made myself familiar with the environment and started to read the task sheet. It was described, that there would be five nodes in the HDP cluster and that I can logon to them from the client machine using the password hadoop. I tried to connect to the host called “namenode”. The host was not reachable. “Maybe it’s not yet up and maybe it’s part of the exam to fix it”, I thought. So, I tried to reach the host called “hivemaster”, which worked. Upon logging in, I quickly noticed that the “hivemaster” thought he was the “namenode” host. So I tried to reach the node that was supposed to be the namenode with it’s ip address, but the host could not be reached. That meant, the problem did not only exist on the client machine that they provided. I tried to access the firewall configuration and checked if I could start the nodes myself (the nodes ran in docker containers on the client machine), none of which would be successful, since I did not have superuser permissions.

After only five minutes, I expressed my frustration to the tired proctor (remember: midnight). He felt sorry for me, but couldn’t help me since he “was not a technical guy” and no “technical guy” was available either at this time of the day. I quit the exam and two hours later I got a voucher for my next “attempt”.

#2

Only a few days later, the next attempt, was fairly similar. I got the exact same exam environment. Once again, I couldn’t start the exam without even having a look at the exam taks. This time I expressed my frustration at the Hortonworks certificate responsible (certification@hortonworks.com) who then personally made sure, that the remote environment is accessible for my third attempt.

Eventually…

One week later everything worked as it was supposed to work. I finished all of the tasks in less than the given 2 hours and received my HDP Certified Admin badge two days later.

One Final Advice

Examslocal suggests you to use a certain screen size/resolution. I did the exam with a notebook with less than the recommended values. It worked, but I had to deal with 3 scroll bars on the right hand side of the screen and 2-3 scroll bars on the bottom of the screen. This can be very confusing and time consuming.

  • Note: Be prepared that people might see the mess around you if you are doing the exam at home. Everything will be recorded 😉