Basics of Hadoop User Management

Hadoop is old, everyone has their own Hadoop cluster and everyone knows how to use it. It’s 2018, right? This article is just a collection of a few gotchas, dos and don’ts with respect to User Management that shouldn’t happen in 2018 anymore.

Terminology

Just a few terms and definitions so that everyone is on the same page for the rest of the article. Roll your eyes and skip that section if you are an advanced user.

  • OS user = user that is provisioned on the operating system level of the nodes of a Hadoop cluster. You can check if the user exists on OS level by doing

id <username>

  • KDC = Key Distribution Center. This might be a standalone KDC implementation, such as the MIT KDC or an integrated one behind a Microsoft Active Directory.
  • Keytab = file that stores the encrypted password of a user provisioned in a KDC. Can be used to authenticate without the need of typing the password using the “kinit” command line tool.

Do

  • Make sure your users are available on all nodes of the OS, as well as in the KDC. This is important for several reasons:
    • When you run a job, the job might create staging/temporary directories in the /tmp/ directory, which are owned by the user running the job. The name of the directory is the name of the OS user, while the ownership belongs to authenticated user. In a secure cluster the authenticated user is the user you obtained a Kerberos ticket for from the KDC.
    • Keytabs on OS level should be only readable by the user OS user who is supposed to authenticate with them for security purposes.
    • When impersonation is turned on for services, e.g., Oozie using the -doas tag, Hive using the property hive.server2.enable.doAs=True property or Storm using the supervisor.run.worker.as.user=true  property, a user authenticated as a principal will run on the OS level as a processs owned by that user. If that user is not known to the OS, the job will fail (to start).

Don’t

  • Don’t use the hdfs user to run jobs on YARN (it’s forbidden by default and don’t change that configuration). Your problem can be solved in a different way! Only use the hdfs user for administrative tasks on the command line.
  • Don’t run Hive jobs as the “hive” user. The “hive” user is the administrative user and if at all should only be used by the Hadoop/database administrator.
  • Or in general: Don’t use the <service name> user to do  <operation> on <service name>. You saw that coming, hm?

How to Achieve Synchronisation of KDC and OS Level

(…or other user/group management systems). This is a tricky one, if you don’t want to run into a split brain situation, where one system knows one set of users and another one knows others, which may or may not overlap.

  • Automate user provisioning, e.g., by using an Ansible role that provisions a user in the KDC and on all nodes of the Hadoop cluster.
  • Use services such as SSSD (System Security Services Daemon) that integrates users and groups from user and group management services into the operating system. So you won’t need to actually add them to each node, as long as SSSD is up and running.
  • Manually create OS users on all nodes and in the KDC (don’t do that, obviously ;P )

 

Maybe I’ll expand that list in the future 🙂

How to Troubleshoot Apache Knox

Apache Knox is a gateway application and the door to access data in a Hadoop cluster hidden behind a firewall. While the usage is fairly simple the setup, configuration and debugging process can be tedious due to many different components that Apache Knox ties together. On Hortonworks Community Connection I wrote an article that shows you exactly what could be wrong with your Knox setup and how to efficiently track the error.

15403-17-05-08-security-concepts-knox

Commont Security Architecture Using Apache Knox for Data Access

 

Hadoop Security Concepts

While security is a quite complex topic by itself, Hadoop Security can be overwhelming. Thus, I wrote down a state of the art article about Hadoop Security Concepts on Hortonworks Community Connection.

15402-17-05-08-security-concepts

Simplified Depiction of the Hadoop Security Architecture

How to Write a Marker File in a Luigi “PigJobTask”

This is supposed to be a brief aid to memory on how to write marker files, when using “Luigi“, which I explained in a former blog post.

What is a Marker File?

A marker file is an empty file created with the sole purpose of signalizing to another process or application that some process is currently ongoing or finished. In the context of scheduling using Luigi, a marker file signalizes the Luigi scheduler that a certain task of a pipeline has already been finished and does not need to (re-)run anymore.

How the Common Luigi Job Rerun Logic Works

Every Luigi task has a run method. In this run method you can use any sort of (Python) code you desire. You can access the input and output streams of the Task object and use it to write data to the output stream. The principle is that a Luigi Task will not run again, if the file with the filename defined in the output target already exists. This can be either a LocalTarget (local file) or an HDFSTarget (file saved to HDFS) or any other custom target. That’s basically it.

How to Write a Marker File in a PigJobTask

Using a PigJobTask, the idea is that you run a Pig script of any complexity. You define the input and output files in your pig script. In the Luigi pipeline, you basically define the pig script location that you want to run and optionally a few other parameters depending on your Hadoop cluster configuration, but you don’t need to implement the run method anymore.

The scenario is that you do not have access to the HDFS output directory, e.g. because its the Hive warehouse directory or the Solr index directory,… or you simply can’t determine the output name of the underlying MapReduce job. So you need to “manually” create an empty file locally or in HDFS that signalizes Luigi that the job already has successfully run. You can specify an arbitrary output file in the output method. This will not create a marker file yet. The trick is to implement the run method specify explicitly to execute the pig script and do arbitrary stuff, such as creating a marker file, afterwards in the method.

You can see a sample PigJobTask that utilizes this technique below

class HiveLoader(luigi.contrib.pig.PigJobTask):
'''
Pig script executor to load files from HDFS into a Hive table (can be Avro, ORC,....)
'''

input_directory = luigi.Parameter()
hive_table = luigi.Parameter()
pig_script = luigi.Parameter()
staging_dir = luigi.Parameter(default='./staging_')

def requires(self):
return DependentTask() # requirement

def output(self):
'''
Here the output file that determines if a task was run is written.
Can be LocalTarget or HDFSTarget or ...
'''
return luigi.LocalTarget(self.staging_dir + "checkpoint")

def pig_options(self):
'''
These are the pig options you want to start the pig client with
'''
return ['-useHCatalog']

def pig_script_path(self):
'''
Execute pig script.
'''
return self.pig_script

def pig_parameters(self):
'''
Set Pig input parameter strings here.
'''
return {'INPUT': self.input_directory,
'HIVE_TABLE': self.hive_table
}

def run(self):
'''
This is the important part. You basically tell the run method to run the Pig
script. Afterwards you do what you want to do. Basically you want to write an
empty output file - or in this case you write "SUCCESS" to the file.
'''
luigi.contrib.pig.PigJobTask.run(self)
with self.output().open('w') as f:
f.write("SUCCESS")

How to Create a Data Pipeline Using Luigi

This is a simple walk-through of an example usage of Luigi. Online there is the excellent documentation of Spotify themselves. You can find all bits and bytes out there to create your own pipeline script. Also, there are already a few blog posts about what is possible when using Luigi, but then – I believe – it’s not very well described how to implement it. So, in my opinion there is either too much information to just try it out or too few information to actually get started hands-on. Also, I’ll mention a word about security.

Therefore, I publish a full working example of a minimalist pipeline from where you can start, copy and paste everything you need

These are the question I try to answer:

  • What is Luigi and when do I want to use it?
  • How do I setup the Luigi scheduler?
  • How do I specify a Luigi pipeline?
  • How do I schedule a Luigi pipeline?
  • Can I use Luigi with a secure Hadoop cluster?
  • What I like about Luigi?

What is Luigi?

Luigi is a framework written in Python that makes it easy to define and execute complex pipelines in a consistent way. You can use Luigi …

  • … when your data is processed in (micro) batches, rather than it is streamed
  • … when you want to run jobs that depend on (many) other jobs.
  • … when you want to have nice visualizations of your pipelines to keep a good overview.
  • … when you want to integrate data into the Hadoop ecosystem.
  • … when you want to do any of the above and love Python.

Create Infrastructure

Every pipeline can actually be tested using the --local-scheduler tag in the command line. But for production you should use a central scheduler running on one node.

The first thing you want to do is to create a user and a group the scheduler is running as.

groupadd luigi
useradd -g luigi luigi

The second step is to create a Luigi config directory.

sudo mkdir /etc/luigi
sudo chown luigi:luigi /etc/luigi

You also need to install Luigi (and Python and pip) if you did not do that already.

pip install luigi

It’s now time to deploy the configuration file. Put the following file into /etc/luigi/luigi.cfg. In this example the Apache Pig home directory of a Hortonworks Hadoop cluster is specified. There are many more configuration options listed in the official documentation.

[core]
default-scheduler-host=www.example.com
default-scheduler-port=8088

[pig]
home=/usr/hdp/current/pig-client

Don’t forget to create directories for the process id of the luigi scheduler daemon, the store log and libs.

sudo mkdir /var/run/luigi
sudo mkdir /var/log/luigi
sudo mkdir /var/lib/luigi
chown luigi:luigi /var/run/luigi
chown luigi:luigi /var/log/luigi
chown luigi:luigi /var/lib/luigi

You are now prepared to start up the scheduler daemon.

sudo su - luigi
luigid --background --port 8088 --address www.example.com --pidfile /var/run/luigi/luigi.pid --logdir /var/log/luigi --state-path /var/lib/luigi/luigi.state'

A Simple Pipeline

We are now ready to go. Let’s specify an example pipeline that actually can be run without a Hadoop ecosystem present: It reads data from a custom file, counting the number of words and writing the output to a file called count.txt. In this example two of the most basic task types are used: luigi.ExternalTask which requires you to implement the output method and luigi.Task which requires you to implement the requires, output and run methods. I added pydocs to all methods and class definitions, so the code below should speak for itself. You can also view it on Github.

import luigi

class FileInput(luigi.ExternalTask):
    '''
    Define the input file for our job:
        The output method of this class defines
        the input file of the class in which FileInput is
        referenced in &quot;requires&quot;
    '''

    # Parameter definition: input file path
    input_path = luigi.Parameter()

    def output(self):
        '''
        As stated: the output method defines a path.
        If the FileInput  class is referenced in a
        &quot;requires&quot; method of another task class, the
        file can be used with the &quot;input&quot; method in that
        class.
        '''
        return luigi.LocalTarget(self.input_path)

class CountIt(luigi.Task):
    '''
    Counts the words from the input file and saves the
    output into another file.
    '''

    input_path = luigi.Parameter()

    def requires(self):
        '''
        Requires the output of the previously defined class.
        Can be used as input in this class.
        '''
        return FileInput(self.input_path)

    def output(self):
        '''
        count.txt is the output file of the job. In a more
        close-to-reality job you would specify a parameter for
        this instead of hardcoding it.
        '''
        return luigi.LocalTarget('count.txt')

    def run(self):
        '''
        This method opens the input file stream, counts the
        words, opens the output file stream and writes the number.
        '''
        word_count = 0
        with self.input().open('r') as ifp:
            for line in ifp:
                word_count += len(line.split(' '))
        with self.output().open('w') as ofp:
            ofp.write(unicode(word_count))

if __name__ == &quot;__main__&quot;:
    luigi.run(main_task_cls=CountIt)

Schedule the Pipeline

To test and schedule your pipeline create a file test.txt with arbitrary content.
We can now execute the pipeline manually by typing

python pipe.py --input-path test.txt

Use the following if you didn’t set up and configure the central scheduler as described above

python pipe.py --input-path test.txt -local-scheduler

If you did everything right you will see that no tasks failed and a file count.txt was created that contains the count of the words of your input file.

Try running this job again. You will notice that Luigi will tell you that there already is a dependency present. Luigi detects that the count.txt is already written and will not run the job again.

Now you can easily trigger this pipeline on a daily base by using, e.g., crontab in order to schedule the job to run, e.g., every minute. If your input and output file has the current date in the filename’s suffix, the job will be triggered every minute, but successfully run only exactly once a day.

In a crontab you could do the following:

1 * * * * python pipe.py --input-path test.txt

Security

The cool thing about Luigi is, that you basically don’t need to worry much about security. Luigi basically uses the security features of the components it interacts with. If you are, e.g., working on a secure Hadoop cluster (that means on a cluster, where Kerberos authentication is enforced) the only thing you need to worry about, is that you obtain a fresh Kerberos ticket before you trigger the job – given that the validity of the ticket is longer than the job needs to finish. I.e., when you schedule your pipeline with cron make sure you do a kinit from a keytab. you can check out my answer to a related question on the Hortonworks community connection for more details on that (https://community.hortonworks.com/questions/5488/what-are-the-required-steps-we-need-to-follow-in-s.html#answer-5490) .

What do I like about Luigi?

It combines my favourite programming language and my favourite distributed ecosystem. I didn’t go too much into that now. But Luigi is especially great because of its rich ways to interact with Hadoop Ecosystem services. Instead of a LocalTarget you would rather use HdfsTargets or Amazon S3Targets. You can define and run Pig jobs and there even is a Apache Hive client built in.

Meet the Hadoop User Group Vienna

A friend said, “Vienna needs a Hadoop User Group” and I agreed with him. The next step was to initialize a Meetup group. Meetup is a platform, where everyone can organize any kind of meetings for any kind of topic. Hadoop recently just started to gain a little traction in Austria and Vienna and I think it’s the perfect time to start a group like this.

This group is for everyone of any level of skill using Apache Hadoop who is located in Vienna. The focus of the group is clearly technical with an eye on use cases. I try to organize technical talks of Hadoop related vendors for the sessions. Also, I want to establish the opportunity working together on real world problems and get hands on Hadoop. In this group we will create a network of Hadoop Users, discuss recent and interesting (technical) topics, eat, drink and – most importantly – have fun together.

I’d like the group to be interactive and that everyone has the opportunity to contribute.

For the first Meetup on Wednesday, May 18, I plan to briefly introduce the goals of the group. I believe all members of the group should brainstorm together, on what all of us expect of the group in the future and try to figure out how often we should meet and which contents we want to work on.

My ideas on how it could look like in the future:

  • One of us could provide some code and walk the others through it. That way the experienced of us can provide feedback and give hints on what to improve and the less experienced gain knowledge.
  • We can define a project to work on together: e.g., building a Hadoop cluster together out of Raspberry Pis, writing streaming applications in Apache Storm or Apache Spark together, or whatever you want,…
  • I plan to combine the Meetup every now and then with the Vienna Kaggle Meetup and do a session about “Data Science and Hadoop”.
  • Similarly to the Vienna Kaggle group, I created a git organisation for code that we work on together. If you are interested to join, just contact me and I will give you access.

I am looking forward to getting to know you as well as hearing your ideas on what to contribute to the group.

 

My Impressions of the Hadoop Summit Dublin 2016

The Hadoop Summit is a tech-conference hosted by Hortonworks, being one of the biggest Apache Hadoop distributors, and Yahoo, being the company in which Hadoop was born. Software developers, consultants, business owners, administrators, that have a mutual interest in Hadoop and the technologies of its ecosystem, all gathered in Dublin – this year’s Hadoop Summit of Europe took place in Ireland. The Hadoop Summit 2016 Dublin had some great  keynotes, plenty of time to network and a lot of exciting talks about bleeding edge technology, its use cases and success stories. Also it was a great opportunity for companies working with Hadoop to present themselves and for the visitors to get to know them.

20160414_103139

Keynote: “Data is Beautiful”

The organisation of the conference was great. 1300 people participated, but it never felt crowded, nor were there any (big) waiting lines to enter the speaker rooms or at the lunch buffet.

My Favorite Talks

This is a list of my favorite talks in a chronological order with their videos embedded. To be honest, this list is basically almost all of the talks that I saw in person and probably I missed even more great talks, that were given in parallel. Fortunately, we can see all of them on the official Hadoop Summit 2016 Dublin Youtube channel.

  • SQL streaming: This talk gave a really nice overview of the development of an SQL streaming solution with all its technical challenges and how they were addressed. Also simple technical use cases were discussed and compared to traditional SQL, where each query terminates, whereas streaming SQL queries never terminate.

  • Hadoop at LinkedIn: Here we got valuable insights into the Hadoop landscape of LinkedIn, as well as job monitoring and automated health checks. A job monitoring tool, Dr. Elephant, developed by LinkedIn was open sourced only a few days before the start of the Summit.

  • IMG-20160423-WA0000Containerization at Spotify:  This talk was about how Spotify uses docker containers and the tools involved in their automated IT landscape. The best part starts at 39:30, where it is revealed, that Spotify overcomes security challenges by not implementing internal security measurements at all. According to the speaker everyone can access everyones data. If life could always be as simple as that 🙂

  • Apache Zeppelin + Apache Livy: Apache Zeppelin already is a great tool for interactive data analysis, exploration or even doing ETL tasks using Apache Pig, querying data using Apache Hive, as well as executing Python, R or bash scripts. Apache Livy helps data scientists work together in one notebook on a secure cluster. What I like a lot about this talk is, that the speakers nicely explain the authentication mechanism involved.

  • Apache Phoenix: Apache Phoenix is a SQL query engine on top of Apache HBase and much more. This talk was basically a view on the capabilities and features of Apache Phoenix. Great stuff – nothing more to add. Watch the video!

10 Years of Hadoop Party

In the night of day one, the Guinness storehouse was utilized as a huge burger-beer-and-big-data networking event. As you can imagine there was good food, Guinness, great music by Irish bands on several floors and of course most importantly the same cool people attending the conference.

IMG_20160413_200647.jpg

Author in the Guinness storehouse

Summary

My first Hadoop Summit attendance was a great experience in all its particulars. I got great contacts, gained lots of knowledge and had lots of fun at the same time. Hopefully, I will be able to attend the next Hadoop Summit 2017 in Munich.