Hadoop is old, everyone has their own Hadoop cluster and everyone knows how to use it. It’s 2018, right? This article is just a collection of a few gotchas, dos and don’ts with respect to User Management that shouldn’t happen in 2018 anymore.
Just a few terms and definitions so that everyone is on the same page for the rest of the article. Roll your eyes and skip that section if you are an advanced user.
OS user = user that is provisioned on the operating system level of the nodes of a Hadoop cluster. You can check if the user exists on OS level by doing
KDC = Key Distribution Center. This might be a standalone KDC implementation, such as the MIT KDC or an integrated one behind a Microsoft Active Directory.
Keytab = file that stores the encrypted password of a user provisioned in a KDC. Can be used to authenticate without the need of typing the password using the “kinit” command line tool.
Make sure your users are available on all nodes of the OS, as well as in the KDC. This is important for several reasons:
When you run a job, the job might create staging/temporary directories in the /tmp/ directory, which are owned by the user running the job. The name of the directory is the name of the OS user, while the ownership belongs to authenticated user. In a secure cluster the authenticated user is the user you obtained a Kerberos ticket for from the KDC.
Keytabs on OS level should be only readable by the user OS user who is supposed to authenticate with them for security purposes.
When impersonation is turned on for services, e.g., Oozie using the -doas tag, Hive using the property hive.server2.enable.doAs=True property or Storm using the supervisor.run.worker.as.user=true property, a user authenticated as a principal will run on the OS level as a processs owned by that user. If that user is not known to the OS, the job will fail (to start).
Don’t use the hdfs user to run jobs on YARN (it’s forbidden by default and don’t change that configuration). Your problem can be solved in a different way! Only use the hdfs user for administrative tasks on the command line.
Don’t run Hive jobs as the “hive” user. The “hive” user is the administrative user and if at all should only be used by the Hadoop/database administrator.
Or in general: Don’t use the <service name> user to do <operation> on <service name>. You saw that coming, hm?
How to Achieve Synchronisation of KDC and OS Level
(…or other user/group management systems). This is a tricky one, if you don’t want to run into a split brain situation, where one system knows one set of users and another one knows others, which may or may not overlap.
Automate user provisioning, e.g., by using an Ansible role that provisions a user in the KDC and on all nodes of the Hadoop cluster.
Use services such as SSSD (System Security Services Daemon) that integrates users and groups from user and group management services into the operating system. So you won’t need to actually add them to each node, as long as SSSD is up and running.
Manually create OS users on all nodes and in the KDC (don’t do that, obviously ;P )
While security is a quite complex topic by itself, security of distributed systems can be overwhelming. Thus, I wrote down a state of the art article about Hadoop (Ecosystem) Security Concepts and also published it on Hortonworks Community Connection.
In the documentation of the particular security related open source projects you can find a number of details on how these components work on their own and which services they rely on. Since the projects are open source you can of course check out the source code for more information. Therefore, this article aims to summarise, rather than explain each process in detail.
In this article I am first going through some basic component descriptions to get an idea which services are in use. Then I explain the “security flow” from a user perspective (authentication –> impersonation (optional) –> authorization –> audit) and provide a short example using Knox.
When reading the article keep following figure in mind. It depicts all the process that I’ll explain.
Knox serves as a gateway and proxy for Hadoop services and their UIs so that they can be accessible behind a firewall without requiring to open too many ports in the firewall.
For the newest HDP release (2.6.0) use these Knox Docs
Authentication Server (AS)
Responsible for issuing Ticket Granting Tickets (TGT)
Ticket Granting Server (TGS)
Responsible for issuing service tickets
Key Distribution Center (KDC)
Talks with clients using KRB5 protocol
AS + TGS
Contains user and group information and talks with its clients using the LDAP protocol.
Wire Encryption Concepts
To complete the picture I just want to mention that it is very important, to not only secure the access of services, but also encrypt data transferred between services.
Keystores and Truststores
To enable a secure connection (SSL) between a server and a client, first an encryption key needs to be created. The server uses it to encrypt any communication. The key is securely stored in a keystore for Java services JKS could be used. In order for a client to trust the server, one could export the key from the keystore and import it into a truststore, which is basically a keystore, containing keys of trusted services. In order to enable two-way SSL the same thing needs to be done on the client side. After creating a key in a keystore the client can access, put it into a trust store of the server. Commands to perform these actions are:
Generate key in "/path/to/keystore.jks" setting its alias to "myKeyAlias" and its password to "myKeyPassword". If the keystore file "/path/to/keystore.jks" does not exist, this will command will also create it.
Only a properly authenticated user (which can also be a service using another service) can communicate successfully with a kerberized Hadoop service. Missing the required authentication, in this case by proving the identity of both user and the service, any communication will fail. In a kerberized environment user authentication is provided via a ticket granting ticket (TGT).
Note: Not using KERBEROS, but SIMPLE authentication, which is set up by default, provides any user with the possibility to act as any other type of user, including the superuser. Therefore strong authentication using Kerberos is highly encouraged.
Technical Authentication Flow:
User requests TGT from AS. This is done automatically upon login or using the kinit command.
User receives TGT from AS.
User sends request to a kerberized service.
User gets service ticket from Ticket Granting Server. This is done automatically in the background when user sends a request to the service.
User sends service a request to the service using the service ticket.
Authentication Flow from a User Perspective:
Most of the above processes are hidden from the user. The only thing, the user needs to do before issuing a request from the service is to login on a machine and thereby receive a TGT or receive it programmatically or obtain it manually using the kinit command.
This is the second step after a user is successfully authenticated at a service. The user must be authenticated, but can then choose to perform the request to the service as another user. If everyone could do this by default, this would raise another security concern and the authentication process would be futile. Therefore this behaviour is forbidden by default for everyone and must be granted for individual users. It is used by proxy services like Apache Ambari, Apache Zeppelin or Apache Knox. Ambari, Zeppelin and Knox authenticate as “ambari”, “zeppelin”, “knox” users, respectively, at the service using their TGTs, but can choose to act on behalf of the person, who is logged in in the browser in Ambari, Zeppelin or Knox. This is why it is very important to secure these services.
To allow, for example, Ambari to perform operations as another user, set the following configs in the core-site.xml, hadoop.proxyuser.ambari.groups and hadoop.proxyuser.ambari.hosts, to a list of groups or hosts that are allowed to be impersonated or set a wildcard *.
Authorization defines the permissions of individual users. After it is clear which user will be performing the request, i.e., the actually authenticated or the impersonated one, the service checks against the local Apache Ranger policies, if the request is allowed for this certain user. This is the last instance in the process. A user passing this step is eventually allowed to perform the requested action.
Every time the authorization instance is called, i.e., policies are checked if the action of a user is authorized or not, an audit event is being logged, containing, time, user, service, action, data set and success of the event. An event is not logged in Ranger in case a user without authentication tries to access data or if a user tries to impersonate another user, without having appropriate permissions to do so.
Example Security Flow Using Apache Knox
Looking at the figure above you can follow what’s going on in the background, when a user Eric wants to push a file into the HDFS service on path “/user/eric/” from outside the Hadoop cluster firewall.
User Eric sends the HDFS request including the file and the command to put that file into the desired directory, while authenticating successfully via LDAP provider at the Apache Knox gateway using his username/password combination. Eric does not need to obtain a Kerberos ticket. In fact, since he is outside the cluster, he probably does not have access to the KDC through the firewall to obtain one anyway.
Knox Ranger plugin checks, if Eric is allowed to use Knox. If he’s not, the process ends here. This event is logged in Ranger audits.
Knox has a valid TGT (and refreshes it before it becomes invalid), obtains a service ticket with it and authenticates at the HDFS namenode as user “knox”.
Knox asks the service to perform the action as Eric, which is configured to be allowed.
Ranger HDFS plugin checks, if Eric has the permission to “WRITE” to “/user/eric”. If he’s not, the process ends here. This event is logged in Ranger audits.
File is pushed to HDFS.
I hope this article helps to get a better understanding of the security concepts within the Hadoop Ecosystem.
This is supposed to be a brief aid to memory on how to write marker files, when using “Luigi“, which I explained in a former blog post.
What is a Marker File?
A marker file is an empty file created with the sole purpose of signalizing to another process or application that some process is currently ongoing or finished. In the context of scheduling using Luigi, a marker file signalizes the Luigi scheduler that a certain task of a pipeline has already been finished and does not need to (re-)run anymore.
How the Common Luigi Job Rerun Logic Works
Every Luigi task has a run method. In this run method you can use any sort of (Python) code you desire. You can access the input and output streams of the Task object and use it to write data to the output stream. The principle is that a Luigi Task will not run again, if the file with the filename defined in the output target already exists. This can be either a LocalTarget (local file) or an HDFSTarget (file saved to HDFS) or any other custom target. That’s basically it.
How to Write a Marker File in a PigJobTask
Using a PigJobTask, the idea is that you run a Pig script of any complexity. You define the input and output files in your pig script. In the Luigi pipeline, you basically define the pig script location that you want to run and optionally a few other parameters depending on your Hadoop cluster configuration, but you don’t need to implement the run method anymore.
The scenario is that you do not have access to the HDFS output directory, e.g. because its the Hive warehouse directory or the Solr index directory,… or you simply can’t determine the output name of the underlying MapReduce job. So you need to “manually” create an empty file locally or in HDFS that signalizes Luigi that the job already has successfully run. You can specify an arbitrary output file in the output method. This will not create a marker file yet. The trick is to implement the run method specify explicitly to execute the pig script and do arbitrary stuff, such as creating a marker file, afterwards in the method.
You can see a sample PigJobTask that utilizes this technique below
Pig script executor to load files from HDFS into a Hive table
(can be Avro, ORC,....)
input_directory = luigi.Parameter()
hive_table = luigi.Parameter()
pig_script = luigi.Parameter()
staging_dir = luigi.Parameter(default='./staging_')
return DependentTask() # requirement
Here the output file that determines if a task was run is written.
Can be LocalTarget or HDFSTarget or ...
return luigi.LocalTarget(self.staging_dir + "checkpoint")
These are the pig options you want to start the pig client with
Execute pig script.
Set Pig input parameter strings here.
This is the important part. You basically tell the run method to run the Pig
script. Afterwards you do what you want to do. Basically you want to write an
empty output file - or in this case you write "SUCCESS" to the file.
with self.output().open('w') as f:
A friend said, “Vienna needs a Hadoop User Group” and I agreed with him. The next step was to initialize a Meetup group. Meetup is a platform, where everyone can organize any kind of meetings for any kind of topic. Hadoop recently just started to gain a little traction in Austria and Vienna and I think it’s the perfect time to start a group like this.
This group is for everyone of any level of skill using Apache Hadoop who is located in Vienna. The focus of the group is clearly technical with an eye on use cases. I try to organize technical talks of Hadoop related vendors for the sessions. Also, I want to establish the opportunity working together on real world problems and get hands on Hadoop. In this group we will create a network of Hadoop Users, discuss recent and interesting (technical) topics, eat, drink and – most importantly – have fun together.
I’d like the group to be interactive and that everyone has the opportunity to contribute.
For the first Meetup on Wednesday, May 18, I plan to briefly introduce the goals of the group. I believe all members of the group should brainstorm together, on what all of us expect of the group in the future and try to figure out how often we should meet and which contents we want to work on.
My ideas on how it could look like in the future:
One of us could provide some code and walk the others through it. That way the experienced of us can provide feedback and give hints on what to improve and the less experienced gain knowledge.
We can define a project to work on together: e.g., building a Hadoop cluster together out of Raspberry Pis, writing streaming applications in Apache Storm or Apache Spark together, or whatever you want,…
I plan to combine the Meetup every now and then with the Vienna Kaggle Meetup and do a session about “Data Science and Hadoop”.
Similarly to the Vienna Kaggle group, I created a git organisation for code that we work on together. If you are interested to join, just contact me and I will give you access.
I am looking forward to getting to know you as well as hearing your ideas on what to contribute to the group.
The Hadoop Summit is a tech-conference hosted by Hortonworks, being one of the biggest Apache Hadoop distributors, and Yahoo, being the company in which Hadoop was born. Software developers, consultants, business owners, administrators, that have a mutual interest in Hadoop and the technologies of its ecosystem, all gathered in Dublin – this year’s Hadoop Summit of Europe took place in Ireland. The Hadoop Summit 2016 Dublin had some great keynotes, plenty of time to network and a lot of exciting talks about bleeding edge technology, its use cases and success stories. Also it was a great opportunity for companies working with Hadoop to present themselves and for the visitors to get to know them.
The organisation of the conference was great. 1300 people participated, but it never felt crowded, nor were there any (big) waiting lines to enter the speaker rooms or at the lunch buffet.
My Favorite Talks
This is a list of my favorite talks in a chronological order with their videos embedded. To be honest, this list is basically almost all of the talks that I saw in person and probably I missed even more great talks, that were given in parallel. Fortunately, we can see all of them on the official Hadoop Summit 2016 Dublin Youtube channel.
SQL streaming: This talk gave a really nice overview of the development of an SQL streaming solution with all its technical challenges and how they were addressed. Also simple technical use cases were discussed and compared to traditional SQL, where each query terminates, whereas streaming SQL queries never terminate.
Hadoop at LinkedIn: Here we got valuable insights into the Hadoop landscape of LinkedIn, as well as job monitoring and automated health checks. A job monitoring tool, Dr. Elephant, developed by LinkedIn was open sourced only a few days before the start of the Summit.
Containerization at Spotify: This talk was about how Spotify uses docker containers and the tools involved in their automated IT landscape. The best part starts at 39:30, where it is revealed, that Spotify overcomes security challenges by not implementing internal security measurements at all. According to the speaker everyone can access everyones data. If life could always be as simple as that 🙂
Apache Zeppelin + Apache Livy: Apache Zeppelin already is a great tool for interactive data analysis, exploration or even doing ETL tasks using Apache Pig, querying data using Apache Hive, as well as executing Python, R or bash scripts. Apache Livy helps data scientists work together in one notebook on a secure cluster. What I like a lot about this talk is, that the speakers nicely explain the authentication mechanism involved.
Apache Phoenix: Apache Phoenix is a SQL query engine on top of Apache HBase and much more. This talk was basically a view on the capabilities and features of Apache Phoenix. Great stuff – nothing more to add. Watch the video!
10 Years of Hadoop Party
In the night of day one, the Guinness storehouse was utilized as a huge burger-beer-and-big-data networking event. As you can imagine there was good food, Guinness, great music by Irish bands on several floors and of course most importantly the same cool people attending the conference.
My first Hadoop Summit attendance was a great experience in all its particulars. I got great contacts, gained lots of knowledge and had lots of fun at the same time. Hopefully, I will be able to attend the next Hadoop Summit 2017 in Munich.
This is the guide to a preparation exam that has similar conditions as in the real one. Follow the instructions here and check if you can finish the tasks. The AWS image you should use is the one for the admin exam. This guide describes the procedure for the developer exam. However, when you search for the image, you can find the admin one easily: Choose the one that suits in step 3. The test exam tasks can easily be found on the website.
This is general information on the HDP certified admin exam and the very basic you need to read. At the end of this page you find several links to documentation. Work through these tasks, understand what you are doing and why you are doing it. Also, make sure you can find these documentation pages by yourself. During the exam you will have access to the internet. That includes the HDP documentation. However, certain websites are blocked, such as the official documentation of Apache Hadoop and others. The best way to go is to learn how to quickly navigate through the HDP documentation. And the best way to learn that is to actively use HDP and its documentation. (http://docs.hortonworks.com/index.html)
Enter https://www.examslocal.com and search for your preferred exam. Register on any available day. Dates are available Monday – Sunday all around the clock can be booked and cancelled up to 24 hours before you want to start the exam. That’s actually really cool.
My Story – All Good Things Come in Threes
It took me three (!) attempts to get HDP Certified Administrator. Here, you see why. This might also be interesting for you if you want to know in detail how the exam procedure works.
The first attempt I was quite nervous. I was ready to go 15 minutes before the exam actually started. I logged into the exam’s online tool and waited until the countdown slowly went down. Then finally, the exam proctor arrived, I connected remotely, started to publish my webcam stream and shared my screen. Then, the guy on the other side of the world – it was midnight at his place, he told me – asked me to perform a few steps to ensure that I was not cheating. He asked me to turn my webcam to make sure that there are no other people close by*. Then he asked me to show him my passport. Before the exam finally could start, he asked me to open my computer’s task manager.
The Exam Starts
First of all I made myself familiar with the environment and started to read the task sheet. It was described, that there would be five nodes in the HDP cluster and that I can logon to them from the client machine using the password hadoop. I tried to connect to the host called “namenode”. The host was not reachable. “Maybe it’s not yet up and maybe it’s part of the exam to fix it”, I thought. So, I tried to reach the host called “hivemaster”, which worked. Upon logging in, I quickly noticed that the “hivemaster” thought he was the “namenode” host. So I tried to reach the node that was supposed to be the namenode with it’s ip address, but the host could not be reached. That meant, the problem did not only exist on the client machine that they provided. I tried to access the firewall configuration and checked if I could start the nodes myself (the nodes ran in docker containers on the client machine), none of which would be successful, since I did not have superuser permissions.
After only five minutes, I expressed my frustration to the tired proctor (remember: midnight). He felt sorry for me, but couldn’t help me since he “was not a technical guy” and no “technical guy” was available either at this time of the day. I quit the exam and two hours later I got a voucher for my next “attempt”.
Only a few days later, the next attempt, was fairly similar. I got the exact same exam environment. Once again, I couldn’t start the exam without even having a look at the exam taks. This time I expressed my frustration at the Hortonworks certificate responsible (email@example.com) who then personally made sure, that the remote environment is accessible for my third attempt.
One week later everything worked as it was supposed to work. I finished all of the tasks in less than the given 2 hours and received my HDP Certified Admin badge two days later.
One Final Advice
Examslocal suggests you to use a certain screen size/resolution. I did the exam with a notebook with less than the recommended values. It worked, but I had to deal with 3 scroll bars on the right hand side of the screen and 2-3 scroll bars on the bottom of the screen. This can be very confusing and time consuming.
Note: Be prepared that people might see the mess around you if you are doing the exam at home. Everything will be recorded 😉