How to Write a Marker File in a Luigi “PigJobTask”

On 29/08/201612/12/2018 By CondlaIn Data EngineeringLeave a comment

This is supposed to be a brief aid to memory on how to write marker files, when using “Luigi“, which I explained in a former blog post.

What is a Marker File?

A marker file is an empty file created with the sole purpose of signalizing to another process or application that some process is currently ongoing or finished. In the context of scheduling using Luigi, a marker file signalizes the Luigi scheduler that a certain task of a pipeline has already been finished and does not need to (re-)run anymore.

How the Common Luigi Job Rerun Logic Works

Every Luigi task has a run method. In this run method you can use any sort of (Python) code you desire. You can access the input and output streams of the Task object and use it to write data to the output stream. The principle is that a Luigi Task will not run again, if the file with the filename defined in the output target already exists. This can be either a LocalTarget (local file) or an HDFSTarget (file saved to HDFS) or any other custom target. That’s basically it.

How to Write a Marker File in a PigJobTask

Using a PigJobTask, the idea is that you run a Pig script of any complexity. You define the input and output files in your pig script. In the Luigi pipeline, you basically define the pig script location that you want to run and optionally a few other parameters depending on your Hadoop cluster configuration, but you don’t need to implement the run method anymore.

The scenario is that you do not have access to the HDFS output directory, e.g. because its the Hive warehouse directory or the Solr index directory,… or you simply can’t determine the output name of the underlying MapReduce job. So you need to “manually” create an empty file locally or in HDFS that signalizes Luigi that the job already has successfully run. You can specify an arbitrary output file in the output method. This will not create a marker file yet. The trick is to implement the run method specify explicitly to execute the pig script and do arbitrary stuff, such as creating a marker file, afterwards in the method.

You can see a sample PigJobTask that utilizes this technique below

class HiveLoader(luigi.contrib.pig.PigJobTask):
    '''
    Pig script executor to load files from HDFS into a Hive table 
    (can be Avro, ORC,....)
    '''

    input_directory = luigi.Parameter()
    hive_table = luigi.Parameter()
    pig_script = luigi.Parameter()
    staging_dir = luigi.Parameter(default='./staging_')

def requires(self):
    return DependentTask() # requirement

def output(self):
    '''
    Here the output file that determines if a task was run is written.
    Can be LocalTarget or HDFSTarget or ...
    '''
    return luigi.LocalTarget(self.staging_dir + "checkpoint")

def pig_options(self):
    '''
    These are the pig options you want to start the pig client with
    '''
    return ['-useHCatalog']

def pig_script_path(self):
    '''
    Execute pig script.
    '''
    return self.pig_script

def pig_parameters(self):
    '''
    Set Pig input parameter strings here.
    '''
    return {
        'INPUT': self.input_directory,
        'HIVE_TABLE': self.hive_table
    }

def run(self):
    '''
    This is the important part. You basically tell the run method to run the Pig
    script. Afterwards you do what you want to do. Basically you want to write an
    empty output file - or in this case you write "SUCCESS" to the file.
    '''
    luigi.contrib.pig.PigJobTask.run(self)
    with self.output().open('w') as f:
    f.write("SUCCESS")

Book Review: Learning Responsive Data Visualization

On 03/08/201612/12/2018 By CondlaIn ReviewsLeave a comment

This post is about describing my experiences reading a book: “Learning Responsive Data Visualization” by Christoph Körner.

What is it all about?

The book aims to explain the concepts and application of responsive data visualization technologies. It describes the famous CSS framework from Twitter “Bootstrap“, SVG graphics and the JavaScript visualization framework D3.js.

The book has 9 chapters: starting from a short introduction of the components in use, it quickly enables the user to create their first visualization and increases the level of detail and complexity systematically. Later, it describes a combined usage of these components and presents techniques on how to create more elaborate layouts and animations.

In the end, the book motivates and explains how to test visualization applications, as well as outlines how to solve cross-browser issues.

About the Author

From Amazon

Christoph Körner, CTO and lead developer at GESIM, a start-up company, is a passionate software engineer, web enthusiast, and an active member of the JavaScript community with more than 5 years of experience in developing customer-oriented web applications. He is the author of Data Visualizations with D3 and AngularJS and is currently pursuing his master’s degree in Visual Computing at Vienna Institute of Technology.

My opinion

Christoph uses short, concise descriptions. Instead of being verbose, he yields many links for further reading to official documentation or interesting blog entries by utilizing non-invasive text boxes throughout the chapters. The author understands how to direct the reader’s attention at the important parts of the technologies introduced.

Nevertheless, here and there, the author finishes a section with an outlook to an advanced topic that sometimes could have needed a little closer attention. An example of this can be found at the end of chapter 2, when the author mentions, that “D3 provided more useful methods on the generator functions”. He then names only one such method and describes it in one sentence. More useful would have been a small list of these methods or to provide yet another of the excellent code examples in the book.

What I really enjoyed is that the author follows the title of the book closely and visualizes not only the code examples but also graphically depicts the concepts and philosophy of the frameworks in use. This helped me a lot to understand the ideas.

One of the most important things of a textbook is to be simple and comprehensible. Christoph easily reaches these goals.

Audience

In my opionon, you need some level of experience with HTML, CSS and JavaScript before you can get started.
Thus, I believe the book aims at developers of intermediate level. On the other hand, if you bring these prerequisites this book is aimed at beginners of D3.js.

Conclusion

At Amazon I rated the book with 4 stars: While I mentioned above, that I like that it is completely fact based and content focused, I kind of miss to get some historical information or funny side stories in footnotes or fact boxes. Instead, fact boxes are used efficiently to point to additional technical content. There are some rare 5-star-books out there that achieve to create this fine bridge of being educational and entertaining. “Learning Responsive Data Visualization” does not build this bridge, but delivers a solid book to teach yourself and others modern responsive data visualization.

How to Create a Data Pipeline Using Luigi

On 19/07/201612/12/2018 By CondlaIn Data Engineering1 Comment

This is a simple walk-through of an example usage of Luigi. Online there is the excellent documentation of Spotify themselves. You can find all bits and bytes out there to create your own pipeline script. Also, there are already a few blog posts about what is possible when using Luigi, but then – I believe – it’s not very well described how to implement it. So, in my opinion there is either too much information to just try it out or too few information to actually get started hands-on. Also, I’ll mention a word about security.

Therefore, I publish a full working example of a minimalist pipeline from where you can start, copy and paste everything you need

These are the question I try to answer:

What is Luigi and when do I want to use it?
How do I setup the Luigi scheduler?
How do I specify a Luigi pipeline?
How do I schedule a Luigi pipeline?
Can I use Luigi with a secure Hadoop cluster?
What I like about Luigi?

What is Luigi?

Luigi is a framework written in Python that makes it easy to define and execute complex pipelines in a consistent way. You can use Luigi …

… when your data is processed in (micro) batches, rather than it is streamed
… when you want to run jobs that depend on (many) other jobs.
… when you want to have nice visualizations of your pipelines to keep a good overview.
… when you want to integrate data into the Hadoop ecosystem.
… when you want to do any of the above and love Python.

Create Infrastructure

Every pipeline can actually be tested using the --local-scheduler tag in the command line. But for production you should use a central scheduler running on one node.

The first thing you want to do is to create a user and a group the scheduler is running as.

groupadd luigi
useradd -g luigi luigi

The second step is to create a Luigi config directory.

sudo mkdir /etc/luigi
sudo chown luigi:luigi /etc/luigi

You also need to install Luigi (and Python and pip) if you did not do that already.

pip install luigi

It’s now time to deploy the configuration file. Put the following file into /etc/luigi/luigi.cfg. In this example the Apache Pig home directory of a Hortonworks Hadoop cluster is specified. There are many more configuration options listed in the official documentation.

[core]
default-scheduler-host=www.example.com
default-scheduler-port=8088

[pig]
home=/usr/hdp/current/pig-client

Don’t forget to create directories for the process id of the luigi scheduler daemon, the store log and libs.

sudo mkdir /var/run/luigi
sudo mkdir /var/log/luigi
sudo mkdir /var/lib/luigi
chown luigi:luigi /var/run/luigi
chown luigi:luigi /var/log/luigi
chown luigi:luigi /var/lib/luigi

You are now prepared to start up the scheduler daemon.

sudo su - luigi
luigid --background --port 8088 --address www.example.com --pidfile /var/run/luigi/luigi.pid --logdir /var/log/luigi --state-path /var/lib/luigi/luigi.state'

A Simple Pipeline

We are now ready to go. Let’s specify an example pipeline that actually can be run without a Hadoop ecosystem present: It reads data from a custom file, counting the number of words and writing the output to a file called count.txt. In this example two of the most basic task types are used: luigi.ExternalTask which requires you to implement the output method and luigi.Task which requires you to implement the requires, output and run methods. I added pydocs to all methods and class definitions, so the code below should speak for itself. You can also view it on Github.

import luigi

class FileInput(luigi.ExternalTask):
'''
Define the input file for our job:
The output method of this class defines
the input file of the class in which FileInput is
referenced in &quot;requires&quot;
'''

# Parameter definition: input file path
input_path = luigi.Parameter()

def output(self):
'''
As stated: the output method defines a path.
If the FileInput  class is referenced in a
&quot;requires&quot; method of another task class, the
file can be used with the &quot;input&quot; method in that
class.
'''
return luigi.LocalTarget(self.input_path)

class CountIt(luigi.Task):
'''
Counts the words from the input file and saves the
output into another file.
'''

input_path = luigi.Parameter()

def requires(self):
'''
Requires the output of the previously defined class.
Can be used as input in this class.
'''
return FileInput(self.input_path)

def output(self):
'''
count.txt is the output file of the job. In a more
close-to-reality job you would specify a parameter for
this instead of hardcoding it.
'''
return luigi.LocalTarget('count.txt')

def run(self):
'''
This method opens the input file stream, counts the
words, opens the output file stream and writes the number.
'''
word_count = 0
with self.input().open('r') as ifp:
for line in ifp:
word_count += len(line.split(' '))
with self.output().open('w') as ofp:
ofp.write(unicode(word_count))

if __name__ == &quot;__main__&quot;:
luigi.run(main_task_cls=CountIt)

Schedule the Pipeline

To test and schedule your pipeline create a file test.txt with arbitrary content.
We can now execute the pipeline manually by typing

python pipe.py --input-path test.txt

Use the following if you didn’t set up and configure the central scheduler as described above

python pipe.py --input-path test.txt -local-scheduler

If you did everything right you will see that no tasks failed and a file count.txt was created that contains the count of the words of your input file.

Try running this job again. You will notice that Luigi will tell you that there already is a dependency present. Luigi detects that the count.txt is already written and will not run the job again.

Now you can easily trigger this pipeline on a daily base by using, e.g., crontab in order to schedule the job to run, e.g., every minute. If your input and output file has the current date in the filename’s suffix, the job will be triggered every minute, but successfully run only exactly once a day.

In a crontab you could do the following:

1 * * * * python pipe.py --input-path test.txt

Security

The cool thing about Luigi is, that you basically don’t need to worry much about security. Luigi basically uses the security features of the components it interacts with. If you are, e.g., working on a secure Hadoop cluster (that means on a cluster, where Kerberos authentication is enforced) the only thing you need to worry about, is that you obtain a fresh Kerberos ticket before you trigger the job – given that the validity of the ticket is longer than the job needs to finish. I.e., when you schedule your pipeline with cron make sure you do a kinit from a keytab. you can check out my answer to a related question on the Hortonworks community connection for more details on that (https://community.hortonworks.com/questions/5488/what-are-the-required-steps-we-need-to-follow-in-s.html#answer-5490) .

What do I like about Luigi?

It combines my favourite programming language and my favourite distributed ecosystem. I didn’t go too much into that now. But Luigi is especially great because of its rich ways to interact with Hadoop Ecosystem services. Instead of a LocalTarget you would rather use HdfsTargets or Amazon S3Targets. You can define and run Pig jobs and there even is a Apache Hive client built in.

How to Write a Command Line Tool in Python

On 15/05/201628/12/2021 By CondlaIn Data EngineeringLeave a comment

Scope and Prerequisites

This rather long blog entry basically consists of two parts:

In the first part “Motivation” we will learn a few reasons on why to wrap a command line tool (in Python) around an existing REST interface.
If you are not interested in that, but want to know how to build a command line tool skip to the second part – “Ingredients“, “Project Structure” and “Installation“.
There, we will learn what we need to create a most basic and simple command line tool, that will enable us to query the publicly available Pokéapi which is a RESTful Pokémon API. We will name the tool “pokepy”. It will retrieve the name of a Pokémon from the Pokéapi based on the Pokémon’s number. From there you can go ahead and write a more complex and extensive command line tool yourself with your own custom logic and your own data source API.

The interface will be as easy as calling

pokepy pokemon id=1

This educational tool is available on https://github.com/condla/pokepy. You only need basic Python knowledge to follow along.
The Pokeapi REST service: https://pokeapi.co/

Motivation
Writing a command line tool can be very handy for various reasons – not only to easily obtain Pokémon information. Imagine you have a data source available as RESTful API, such as the Pokéapi. If you wanted to use an API like this just to look up information occasionally, you could put an often quite long query into your browser, fill in the parameters and press enter. The result would show in your browser. Often a REST API exposes more information than you actually need in your daily life and you would need to use your browser search function to get to the data point you need.
You could also use a command line tool like “curl” to query the API, which brings in another advantage. You can now send these requests within a bash script.
curl https://pokeapi.co/api/v2/pokemon/150/

For simple queries like this you could then parameterize the URL by setting it in an environment variable. This is easier to remember as well as easier and faster to type.
export POKEURL=https://pokeapi.co/api/v2/pokemon;
curl $POKEURL/150/

Now, why do we want to wrap something like this into a python command line tool, when the above command already looks so easy? There are several reasons:

We are only doing GET requests for now. Other APIs allow you to do all sorts of REST calls (PUT, POST, DELETE), which makes it complex to parameterize using environment variables. Wrapping it into Python logic makes the API once again more accessible and user friendly.
Also, if you have a look at the Pokéapi you notice that you can not only query for Pokémon, but also for types and abilities. This introduces another level of complexity in building the URL string with environment variables (https://pokeapi.co/api/v2/pokemon/, https://pokeapi.co/api/v2/type/, https://pokeapi.co/api/v2/ability/). This task can be tackled more elegantly in Python.
Additionally, a (Python) command line tool proves really useful, when you want to do REST calls against an API, that changes the state of the underlying system.

This is easier to write, read, configure and memorize:
example-tool put state=up

than this:
curl -H 'Content-Type: application/json' -X PUT -d '{state:up}'http://example.com/api/v2/service/


You can put a lot of custom logic into the command line tool to transform data, merge data from two or more different APIs, make calculations and customize the output to be either human or machine readable, or both.

Ingredients
We will be using the docopt module as a command line argument parser, as well as requests to send the request to the Pokéapi. We will also need to have python-pip installed. Python pip can be installed easily via your favourite package manager. On Ubuntu you would do:
sudo apt-get install python-pip

There are many other libraries out there to parse command line arguments or send HTTP requests. This should merely serve as an example.
Project Structure
The minimum requirements on the project structure are the following.
pokepy/
├── pokepy
│  ├── __init__.py
└── setup.py 
In my github repo you see a few more files, which are necessary to put the module into the Python Package Index. More on packaging a module can be found here.
The following sections explain and describe the essence of these files:
pokepy/__init__.py
This file serves as the entry point of our command line tool, it is also the required file to specify that this is actually a module and it contains all of our logic. Usually, we would separate these three things, but for simplicity we just keep it in one file. Below you can see the code:

Since we are using docopt, lines 1 to 8 completely define the usage of the command line interface. If an end user does not follow the rules defined in this doc string interface, the usage doc string will be printed to the screen.
The entry point of the script is on line 51.
On line 55 we import the docopt module.
If end users follow the rules defined in the doc string, the command line arguments will be parsed on line 56.
Lines 57 and 58 read out the parsed command line arguments, by calling the two functions on lines 17 and 29.
On line 59 the actual logic of the tool “call_pokeapi(path, id_number)” is called.
call_pokeapi(path, id_number) builds the URL and utilizes the requests module to do the REST call. If the default key “name” exists in the REST call, the value of the json response is returned. If the default key “name” does not exist, the assumption here is that we are out of range of existing Pokemons and therefore receive an error message response. This response has only one key: “detail”. In this case we print out the value of “detail” (which is expected to be “Not found.” 🙂 )

'''
Usage:
    pokepy (pokemon | type | ability) --id=ID

Options:
    -i --id=ID # specify the id of the pokemon, type or ability
    -h --help # Show this help
'''

import requests

POKEAPI = 'https://pokeapi.co/api/v2/{path}/{id}'

def get_api_path(arguments):
    '''
    Get pokemon or type or ability command from command line
    arguments.
    '''
    paths = ['pokemon', 'type', 'ability']
    for path in paths:
        if arguments[path]:
            break
    return path

def get_id(arguments):
    '''
    Get id from command line arguments.
    '''
    return arguments['--id']

def call_pokeapi(path, id_number, key='name'):
    '''
    Call the RESTful PokeAPI and parse the response. If pokemon, ability or
    type ids are not found than the error message detail is returned.
    '''
    url = POKEAPI.format(path=path, id=id_number)
    response = requests.get(url)
    response_json = response.json()
    try:
        res = response_json[key]
    except:
        res = response_json['detail']
    return res

def __main__():
    '''
    Entrypoint of command line interface.
    '''
    from docopt import docopt
    arguments = docopt(__doc__, version='0.1.0')
    path = get_api_path(arguments)
    id_number = get_id(arguments)
    print(call_pokeapi(path, id_number))

setup.py
Now we need to tell Python, that we want to use our module as a command line tool, after installing it. Have a look at the code below:

Lines 1 to 9 are basically boiler plate and don’t do much.
Then the setup method is called with a lot of partly self explanatory and partly boring parameters. What we really need here are the following two parameters:

install_requires where we specify a list of dependencies that will be installed by pip, if the requirements are not already satisfied.
entry_points where we specify an entry point “console_scripts” in a dictionary. The value pokepy=pokepy:__main__ means, that when we call “pokepy” from the command line, the __main__ method of the pokepy module will be called.



'''
pokepy setup module
'''

from setuptools import setup, find_packages
from codecs import open
from os import path

here = path.abspath(path.dirname(__file__))

setup(
    name='pokepy',
    version='0.1.0',
    description='A Pokeapi wrapper command line tool',
    long_description=long_description,
    url='https://github.com/condla/pokepy',
    author='Stefan Kupstaitis-Dunkler',
    author_email='email.address@gmail.com',
    license='Apache 2.0',
    classifiers=[
        'Development Status :: 3 - Alpha'
    ],

    keywords='Pokeapi REST client wrapper command line interface',
    packages=find_packages(),
    install_requires=['docopt', 'requests'],
    extras_require={},
    package_data={},
    package_data={},

    entry_points={
        'console_scripts': [
            'pokepy=pokepy:__main__',
        ],
    },
)

Installation
The only thing that’s left is to install our tool and put it to use. I would recommend you to do it in an own virtual environment, but it is not mandatory. In the project directory do:
# this will create a new virtual python environment in the env directory
virtualenv env
# this will activate the environment (now you can install anything into this environment without affecting the rest of the environment)
source env/bin/activate
# install the pokepy module into your virtual environment
pip install -e .

Congratulations you can now go ahead and use your command line tool for example like this (the $ symbol represents the command prompt):
$ pokepy pokemon -i 25
pikachu

Conclusion
We saw why it is useful to wrap an API into a command line interface and how it is done in Python. Now you know everything to go ahead and create more useful tools with a more complex logic by just extending this module fit to your needs.

Maker Faire Vienna 2016

On 28/04/201612/12/2018 By CondlaIn ReviewsLeave a comment

I always wanted to attend a Maker Faire whenever I heard about it. A fair for people who build and create things with their own minds and hands; a fair for children to show them how accessible technology is and how easy it is to get started to build their own things; a fair for those who do instead of just keep talking. However, until now I never had the time or was just too far away to attend. Last weekend, April 16 and 17, the Maker Faire came to Vienna, which meant that I finally had this opportunity.

IMG_20160416_124230 — The One Love Machine Band

Makers

There is not enough space here to count all the great and cool things that were shown at the Maker Faire Vienna 2016, but we saw a machine that was automatically baking typical Austrian pancakes “Palatschinken“, another one that created typical Austrian “Spritzer” – a mixture of sparkling water and wine. A technical college who showcased their pupils’ cool projects was not missed. Also, the racing team and the space team of the Vienna University of Technology were there. Vienna’s hacker spaces had their own booths and many many more people who demonstrated their skills. Pity, it would not be feasible to name them all. High rooms, old wooden floors, big wooden bars across the room and huge pillars impregnated a special atmosphere to the event.

Talks

Since there were so many things to see and I had to prepare my own talk “Smart Home – from Maker to Market”, I can’t really say anything about any of the talks there. They are all available on Vimeo, so I guess it’s worth watching them if you are interested and couldn’t attend the sessions.

I gave a talk that was not so much about a smart home, but what would happen if we connected many smart homes, which benefits we could have, the challenges as well as the high level architecture. Well, I would have liked to talk about all of these things in more detail, but I was limited to 30 minutes. Watch the video [german] and if you have questions, don’t hesitate to contact me.

Summary

I am excited to see this event outgrowing its current location and hopefully the Maker Faire Vienna 2017 comes back with even more makers, more cool projects and more great people.

Meet the Hadoop User Group Vienna

On 23/04/201612/12/2018 By CondlaIn ReviewsLeave a comment

A friend said, “Vienna needs a Hadoop User Group” and I agreed with him. The next step was to initialize a Meetup group. Meetup is a platform, where everyone can organize any kind of meetings for any kind of topic. Hadoop recently just started to gain a little traction in Austria and Vienna and I think it’s the perfect time to start a group like this.

This group is for everyone of any level of skill using Apache Hadoop who is located in Vienna. The focus of the group is clearly technical with an eye on use cases. I try to organize technical talks of Hadoop related vendors for the sessions. Also, I want to establish the opportunity working together on real world problems and get hands on Hadoop. In this group we will create a network of Hadoop Users, discuss recent and interesting (technical) topics, eat, drink and – most importantly – have fun together.

I’d like the group to be interactive and that everyone has the opportunity to contribute.

For the first Meetup on Wednesday, May 18, I plan to briefly introduce the goals of the group. I believe all members of the group should brainstorm together, on what all of us expect of the group in the future and try to figure out how often we should meet and which contents we want to work on.

My ideas on how it could look like in the future:

One of us could provide some code and walk the others through it. That way the experienced of us can provide feedback and give hints on what to improve and the less experienced gain knowledge.
We can define a project to work on together: e.g., building a Hadoop cluster together out of Raspberry Pis, writing streaming applications in Apache Storm or Apache Spark together, or whatever you want,…
I plan to combine the Meetup every now and then with the Vienna Kaggle Meetup and do a session about “Data Science and Hadoop”.
Similarly to the Vienna Kaggle group, I created a git organisation for code that we work on together. If you are interested to join, just contact me and I will give you access.

I am looking forward to getting to know you as well as hearing your ideas on what to contribute to the group.

My Impressions of the Hadoop Summit Dublin 2016

On 23/04/201612/12/2018 By CondlaIn ReviewsLeave a comment

The Hadoop Summit is a tech-conference hosted by Hortonworks, being one of the biggest Apache Hadoop distributors, and Yahoo, being the company in which Hadoop was born. Software developers, consultants, business owners, administrators, that have a mutual interest in Hadoop and the technologies of its ecosystem, all gathered in Dublin – this year’s Hadoop Summit of Europe took place in Ireland. The Hadoop Summit 2016 Dublin had some great keynotes, plenty of time to network and a lot of exciting talks about bleeding edge technology, its use cases and success stories. Also it was a great opportunity for companies working with Hadoop to present themselves and for the visitors to get to know them.

20160414_103139 — Keynote: “Data is Beautiful”

The organisation of the conference was great. 1300 people participated, but it never felt crowded, nor were there any (big) waiting lines to enter the speaker rooms or at the lunch buffet.

My Favorite Talks

This is a list of my favorite talks in a chronological order with their videos embedded. To be honest, this list is basically almost all of the talks that I saw in person and probably I missed even more great talks, that were given in parallel. Fortunately, we can see all of them on the official Hadoop Summit 2016 Dublin Youtube channel.

SQL streaming: This talk gave a really nice overview of the development of an SQL streaming solution with all its technical challenges and how they were addressed. Also simple technical use cases were discussed and compared to traditional SQL, where each query terminates, whereas streaming SQL queries never terminate.

Hadoop at LinkedIn: Here we got valuable insights into the Hadoop landscape of LinkedIn, as well as job monitoring and automated health checks. A job monitoring tool, Dr. Elephant, developed by LinkedIn was open sourced only a few days before the start of the Summit.

Containerization at Spotify: This talk was about how Spotify uses docker containers and the tools involved in their automated IT landscape. The best part starts at 39:30, where it is revealed, that Spotify overcomes security challenges by not implementing internal security measurements at all. According to the speaker everyone can access everyones data. If life could always be as simple as that 🙂

Apache Zeppelin + Apache Livy: Apache Zeppelin already is a great tool for interactive data analysis, exploration or even doing ETL tasks using Apache Pig, querying data using Apache Hive, as well as executing Python, R or bash scripts. Apache Livy helps data scientists work together in one notebook on a secure cluster. What I like a lot about this talk is, that the speakers nicely explain the authentication mechanism involved.

Apache Phoenix: Apache Phoenix is a SQL query engine on top of Apache HBase and much more. This talk was basically a view on the capabilities and features of Apache Phoenix. Great stuff – nothing more to add. Watch the video!

10 Years of Hadoop Party

In the night of day one, the Guinness storehouse was utilized as a huge burger-beer-and-big-data networking event. As you can imagine there was good food, Guinness, great music by Irish bands on several floors and of course most importantly the same cool people attending the conference.

Summary

My first Hadoop Summit attendance was a great experience in all its particulars. I got great contacts, gained lots of knowledge and had lots of fun at the same time. Hopefully, I will be able to attend the next Hadoop Summit 2017 in Munich.

Which Command Line Tool Does Not Exist?

On 03/04/201612/12/2018 By CondlaIn Data EngineeringLeave a comment

Answer the Question!

Setting Up Apache Nifi on a Raspberry Pi

On 03/04/201612/12/2018 By CondlaIn Data EngineeringLeave a comment

Apache NiFi is part of the Hortonworks Data Flow (HDF) product and manages data flows. The Raspberry Pi is a small, open source, multi-purpose computer. If you are not familiar with one or more of these products, just follow the links for more information. 🙂

Hardware and Software Specifications

Hardware: Raspberry Pi 2.
Operating System: Raspbian version March-2016 (Download).
Bootstrapping the RasPi: using my prepared Ansible script. Check out the github project Boostrap Raspbian with Ansible and the corresponding article How to Setup the Raspberry Pi 3 Using Ansible for more information.
Software: HDF 1.2 (Download).

Setup

Download and unzip HDF. I put it into the home directory of the RasPi:
```
pi@raspberrypi:~/HDF-1.2.0.0/
```

Install NiFi:

pi@raspberrypi:~/HDF-1.2.0.0/nifi/bin $ sudo ./nifi.sh install

Start NiFi:
```
/etc/init.d/nifi start
```
For details check the official docs:
- https://nifi.apache.org/docs.html
- http://docs.hortonworks.com/HDPDocuments/HDF1/HDF-1.2/bk_HDF_InstallSetup/content/index.html

Impressions and Remarks

Docs say that after installation the command
```
service nifi start
```
should work out of the box, but for me only this works without further modifications:
```
/etc/init.d/nifi start
```
After starting, I tried to access the Web Interface, but it didn’t work. I checked the logs, but everything seemed alright. I saw something like the following in the nifi-bootstrap.log
```
2016-04-02 21:06:29,563 INFO [NiFi Bootstrap Command Listener] org.apache.nifi.bootstrap.RunNiFi Apache NiFi now running and listening for Bootstrap requests on port 47094
```
After 6 minutes and 3 seconds, the web interface was available though. As you can see in the screenshot below HDF takes 100% of one core of the RasPi during the start up process:

screenshot_top_nifi — The HDF start-up process occupies one full core of the RasPi

After the webserver is up and running, NiFi’s resource usage looks more moderate:

screenshot_top_nifi_running — NiFi needs about 16.7% of (400% of) CPU and almost 40.5 % of the RasPi’s RAM

I followed the “Getting Started” where NiFi is configured to have two processors, one of which reads files from the disk, sends them to the other processor and deletes them. The other processor just receives the files and logs their information to the nifi-app.log. Although the name of the processor “LogAttribute” is quite obvious, the official documentation does not provide a description on what it actually does. I found this amazing blog post on a www.nifi.rocks, where quite a lot of processors are described.

test_data — Writing a file, then being deleted by the NiFi GetFile processor 100000 times, then …

hdf_ui — …, then getting transfered to the LogAttribute processor, and finally …

… finally the LogAttribute processor logs the incoming FlowFile data in the nifi-app.log.

Conclusion

NiFi is as easy to install on a Raspberry Pi as anywhere else and sticks out with all of its features, being complex but not complicated. I did not test a lot of different processors on the RasPi nor did I test this simple setup with large amounts of data, but even in its simplicity the possibilities are endless. Combining the power and easy of use of the RasPi’s GPIOs with NiFi’s power and simplicity to direct and redirect data (flows), practically every child can, e.g., send temperature sensor data into a Hadoop File System and even process and filter it on its way.

What is the Expected Amount of Data Produced by the Large Synoptic Survey Telescope per Day?

On 25/03/201612/12/2018 By CondlaIn UncategorizedLeave a comment

Answer the Question!

First guess then get more information:

on the project’s website: http://www.lsst.org/
on Wikipedia: Large Synoptic Survey Telescope