Hadoop Security Concepts

While security is a quite complex topic by itself, Hadoop Security can be overwhelming. Thus, I wrote down a state of the art article about Hadoop Security Concepts on Hortonworks Community Connection.

15402-17-05-08-security-concepts

Simplified Depiction of the Hadoop Security Architecture

How to Write a Marker File in a Luigi “PigJobTask”

This is supposed to be a brief aid to memory on how to write marker files, when using “Luigi“, which I explained in a former blog post.

What is a Marker File?

A marker file is an empty file created with the sole purpose of signalizing to another process or application that some process is currently ongoing or finished. In the context of scheduling using Luigi, a marker file signalizes the Luigi scheduler that a certain task of a pipeline has already been finished and does not need to (re-)run anymore.

How the Common Luigi Job Rerun Logic Works

Every Luigi task has a run method. In this run method you can use any sort of (Python) code you desire. You can access the input and output streams of the Task object and use it to write data to the output stream. The principle is that a Luigi Task will not run again, if the file with the filename defined in the output target already exists. This can be either a LocalTarget (local file) or an HDFSTarget (file saved to HDFS) or any other custom target. That’s basically it.

How to Write a Marker File in a PigJobTask

Using a PigJobTask, the idea is that you run a Pig script of any complexity. You define the input and output files in your pig script. In the Luigi pipeline, you basically define the pig script location that you want to run and optionally a few other parameters depending on your Hadoop cluster configuration, but you don’t need to implement the run method anymore.

The scenario is that you do not have access to the HDFS output directory, e.g. because its the Hive warehouse directory or the Solr index directory,… or you simply can’t determine the output name of the underlying MapReduce job. So you need to “manually” create an empty file locally or in HDFS that signalizes Luigi that the job already has successfully run. You can specify an arbitrary output file in the output method. This will not create a marker file yet. The trick is to implement the run method specify explicitly to execute the pig script and do arbitrary stuff, such as creating a marker file, afterwards in the method.

You can see a sample PigJobTask that utilizes this technique below

class HiveLoader(luigi.contrib.pig.PigJobTask):
'''
Pig script executor to load files from HDFS into a Hive table (can be Avro, ORC,....)
'''

input_directory = luigi.Parameter()
hive_table = luigi.Parameter()
pig_script = luigi.Parameter()
staging_dir = luigi.Parameter(default='./staging_')

def requires(self):
return DependentTask() # requirement

def output(self):
'''
Here the output file that determines if a task was run is written.
Can be LocalTarget or HDFSTarget or ...
'''
return luigi.LocalTarget(self.staging_dir + "checkpoint")

def pig_options(self):
'''
These are the pig options you want to start the pig client with
'''
return ['-useHCatalog']

def pig_script_path(self):
'''
Execute pig script.
'''
return self.pig_script

def pig_parameters(self):
'''
Set Pig input parameter strings here.
'''
return {'INPUT': self.input_directory,
'HIVE_TABLE': self.hive_table
}

def run(self):
'''
This is the important part. You basically tell the run method to run the Pig
script. Afterwards you do what you want to do. Basically you want to write an
empty output file - or in this case you write "SUCCESS" to the file.
'''
luigi.contrib.pig.PigJobTask.run(self)
with self.output().open('w') as f:
f.write("SUCCESS")

Meet the Hadoop User Group Vienna

A friend said, “Vienna needs a Hadoop User Group” and I agreed with him. The next step was to initialize a Meetup group. Meetup is a platform, where everyone can organize any kind of meetings for any kind of topic. Hadoop recently just started to gain a little traction in Austria and Vienna and I think it’s the perfect time to start a group like this.

This group is for everyone of any level of skill using Apache Hadoop who is located in Vienna. The focus of the group is clearly technical with an eye on use cases. I try to organize technical talks of Hadoop related vendors for the sessions. Also, I want to establish the opportunity working together on real world problems and get hands on Hadoop. In this group we will create a network of Hadoop Users, discuss recent and interesting (technical) topics, eat, drink and – most importantly – have fun together.

I’d like the group to be interactive and that everyone has the opportunity to contribute.

For the first Meetup on Wednesday, May 18, I plan to briefly introduce the goals of the group. I believe all members of the group should brainstorm together, on what all of us expect of the group in the future and try to figure out how often we should meet and which contents we want to work on.

My ideas on how it could look like in the future:

  • One of us could provide some code and walk the others through it. That way the experienced of us can provide feedback and give hints on what to improve and the less experienced gain knowledge.
  • We can define a project to work on together: e.g., building a Hadoop cluster together out of Raspberry Pis, writing streaming applications in Apache Storm or Apache Spark together, or whatever you want,…
  • I plan to combine the Meetup every now and then with the Vienna Kaggle Meetup and do a session about “Data Science and Hadoop”.
  • Similarly to the Vienna Kaggle group, I created a git organisation for code that we work on together. If you are interested to join, just contact me and I will give you access.

I am looking forward to getting to know you as well as hearing your ideas on what to contribute to the group.

 

My Impressions of the Hadoop Summit Dublin 2016

The Hadoop Summit is a tech-conference hosted by Hortonworks, being one of the biggest Apache Hadoop distributors, and Yahoo, being the company in which Hadoop was born. Software developers, consultants, business owners, administrators, that have a mutual interest in Hadoop and the technologies of its ecosystem, all gathered in Dublin – this year’s Hadoop Summit of Europe took place in Ireland. The Hadoop Summit 2016 Dublin had some great  keynotes, plenty of time to network and a lot of exciting talks about bleeding edge technology, its use cases and success stories. Also it was a great opportunity for companies working with Hadoop to present themselves and for the visitors to get to know them.

20160414_103139

Keynote: “Data is Beautiful”

The organisation of the conference was great. 1300 people participated, but it never felt crowded, nor were there any (big) waiting lines to enter the speaker rooms or at the lunch buffet.

My Favorite Talks

This is a list of my favorite talks in a chronological order with their videos embedded. To be honest, this list is basically almost all of the talks that I saw in person and probably I missed even more great talks, that were given in parallel. Fortunately, we can see all of them on the official Hadoop Summit 2016 Dublin Youtube channel.

  • SQL streaming: This talk gave a really nice overview of the development of an SQL streaming solution with all its technical challenges and how they were addressed. Also simple technical use cases were discussed and compared to traditional SQL, where each query terminates, whereas streaming SQL queries never terminate.

  • Hadoop at LinkedIn: Here we got valuable insights into the Hadoop landscape of LinkedIn, as well as job monitoring and automated health checks. A job monitoring tool, Dr. Elephant, developed by LinkedIn was open sourced only a few days before the start of the Summit.

  • IMG-20160423-WA0000Containerization at Spotify:  This talk was about how Spotify uses docker containers and the tools involved in their automated IT landscape. The best part starts at 39:30, where it is revealed, that Spotify overcomes security challenges by not implementing internal security measurements at all. According to the speaker everyone can access everyones data. If life could always be as simple as that 🙂

  • Apache Zeppelin + Apache Livy: Apache Zeppelin already is a great tool for interactive data analysis, exploration or even doing ETL tasks using Apache Pig, querying data using Apache Hive, as well as executing Python, R or bash scripts. Apache Livy helps data scientists work together in one notebook on a secure cluster. What I like a lot about this talk is, that the speakers nicely explain the authentication mechanism involved.

  • Apache Phoenix: Apache Phoenix is a SQL query engine on top of Apache HBase and much more. This talk was basically a view on the capabilities and features of Apache Phoenix. Great stuff – nothing more to add. Watch the video!

10 Years of Hadoop Party

In the night of day one, the Guinness storehouse was utilized as a huge burger-beer-and-big-data networking event. As you can imagine there was good food, Guinness, great music by Irish bands on several floors and of course most importantly the same cool people attending the conference.

IMG_20160413_200647.jpg

Author in the Guinness storehouse

Summary

My first Hadoop Summit attendance was a great experience in all its particulars. I got great contacts, gained lots of knowledge and had lots of fun at the same time. Hopefully, I will be able to attend the next Hadoop Summit 2017 in Munich.

Hortonworks HDP Admin Certification Preparation

I recently received the Hortonworks Admin Certification and wrote down the most important steps to get certified as well as my experience in this blog entry.

Types of Certification Offered By Hortonworks

At the moment Hortonworks offers 3 certifications, more of which will follow as their training offering will grow.

  • HDP Developer (Java)
  • HDP Developer (Pig, Hive)
  • HDP Administrator

How to Prepare for the Exam

Here are a few links that helped me prepare for the exam and some additional information.

https://hortonworks.com/wp-content/uploads/2015/02/HDPCD-PracticeExamGuide1.pdf

This is the guide to a preparation exam that has similar conditions as in the real one. Follow the instructions here and check if you can finish the tasks. The AWS image you should use is the one for the admin exam. This guide describes the procedure for the developer exam. However, when you search for the image, you can find the admin one easily: Choose the one that suits in step 3. The test exam tasks can easily be found on the website.

http://hortonworks.com/training/class/hdp-certified-administrator-hdpca-exam/

This is general information on the HDP certified admin exam and the very basic you need to read. At the end of this page you find several links to documentation. Work through these tasks, understand what you are doing and why you are doing it. Also, make sure you can find these documentation pages by yourself. During the exam you will have access to the internet. That includes the HDP documentation. However, certain websites are blocked, such as the official documentation of Apache Hadoop and others. The best way to go is to learn how to quickly navigate through the HDP documentation. And the best way to learn that is to actively use HDP and its documentation. (http://docs.hortonworks.com/index.html)

Registration Process

Enter https://www.examslocal.com and search for your preferred exam. Register on any available day. Dates are available Monday – Sunday all around the clock can be booked and cancelled up to 24 hours before you want to start the exam. That’s actually really cool.

My Story – All Good Things Come in Threes

It took me three (!) attempts to get HDP Certified Administrator. Here, you see why. This might also be interesting for you if you want to know in detail how the exam procedure works.

Preparation

The first attempt I was quite nervous. I was ready to go 15 minutes before the exam actually started. I logged into the exam’s online tool and waited until the countdown slowly went down. Then finally, the exam proctor arrived, I connected remotely, started to publish my webcam stream and shared my screen. Then, the guy on the other side of the world – it was midnight at his place, he told me – asked me to perform a few steps to ensure that I was not cheating. He asked me to turn my webcam to make sure that there are no other people close by*. Then he asked me to show him my passport. Before the exam finally could start, he asked me to open my computer’s task manager.

The Exam Starts

First of all I made myself familiar with the environment and started to read the task sheet. It was described, that there would be five nodes in the HDP cluster and that I can logon to them from the client machine using the password hadoop. I tried to connect to the host called “namenode”. The host was not reachable. “Maybe it’s not yet up and maybe it’s part of the exam to fix it”, I thought. So, I tried to reach the host called “hivemaster”, which worked. Upon logging in, I quickly noticed that the “hivemaster” thought he was the “namenode” host. So I tried to reach the node that was supposed to be the namenode with it’s ip address, but the host could not be reached. That meant, the problem did not only exist on the client machine that they provided. I tried to access the firewall configuration and checked if I could start the nodes myself (the nodes ran in docker containers on the client machine), none of which would be successful, since I did not have superuser permissions.

After only five minutes, I expressed my frustration to the tired proctor (remember: midnight). He felt sorry for me, but couldn’t help me since he “was not a technical guy” and no “technical guy” was available either at this time of the day. I quit the exam and two hours later I got a voucher for my next “attempt”.

#2

Only a few days later, the next attempt, was fairly similar. I got the exact same exam environment. Once again, I couldn’t start the exam without even having a look at the exam taks. This time I expressed my frustration at the Hortonworks certificate responsible (certification@hortonworks.com) who then personally made sure, that the remote environment is accessible for my third attempt.

Eventually…

One week later everything worked as it was supposed to work. I finished all of the tasks in less than the given 2 hours and received my HDP Certified Admin badge two days later.

One Final Advice

Examslocal suggests you to use a certain screen size/resolution. I did the exam with a notebook with less than the recommended values. It worked, but I had to deal with 3 scroll bars on the right hand side of the screen and 2-3 scroll bars on the bottom of the screen. This can be very confusing and time consuming.

  • Note: Be prepared that people might see the mess around you if you are doing the exam at home. Everything will be recorded 😉