How to Write a Marker File in a Luigi “PigJobTask”

This is supposed to be a brief aid to memory on how to write marker files, when using “Luigi“, which I explained in a former blog post.

What is a Marker File?

A marker file is an empty file created with the sole purpose of signalizing to another process or application that some process is currently ongoing or finished. In the context of scheduling using Luigi, a marker file signalizes the Luigi scheduler that a certain task of a pipeline has already been finished and does not need to (re-)run anymore.

How the Common Luigi Job Rerun Logic Works

Every Luigi task has a run method. In this run method you can use any sort of (Python) code you desire. You can access the input and output streams of the Task object and use it to write data to the output stream. The principle is that a Luigi Task will not run again, if the file with the filename defined in the output target already exists. This can be either a LocalTarget (local file) or an HDFSTarget (file saved to HDFS) or any other custom target. That’s basically it.

How to Write a Marker File in a PigJobTask

Using a PigJobTask, the idea is that you run a Pig script of any complexity. You define the input and output files in your pig script. In the Luigi pipeline, you basically define the pig script location that you want to run and optionally a few other parameters depending on your Hadoop cluster configuration, but you don’t need to implement the run method anymore.

The scenario is that you do not have access to the HDFS output directory, e.g. because its the Hive warehouse directory or the Solr index directory,… or you simply can’t determine the output name of the underlying MapReduce job. So you need to “manually” create an empty file locally or in HDFS that signalizes Luigi that the job already has successfully run. You can specify an arbitrary output file in the output method. This will not create a marker file yet. The trick is to implement the run method specify explicitly to execute the pig script and do arbitrary stuff, such as creating a marker file, afterwards in the method.

You can see a sample PigJobTask that utilizes this technique below

class HiveLoader(luigi.contrib.pig.PigJobTask):
    '''
    Pig script executor to load files from HDFS into a Hive table 
    (can be Avro, ORC,....)
    '''

    input_directory = luigi.Parameter()
    hive_table = luigi.Parameter()
    pig_script = luigi.Parameter()
    staging_dir = luigi.Parameter(default='./staging_')

def requires(self):
    return DependentTask() # requirement

def output(self):
    '''
    Here the output file that determines if a task was run is written.
    Can be LocalTarget or HDFSTarget or ...
    '''
    return luigi.LocalTarget(self.staging_dir + "checkpoint")

def pig_options(self):
    '''
    These are the pig options you want to start the pig client with
    '''
    return ['-useHCatalog']

def pig_script_path(self):
    '''
    Execute pig script.
    '''
    return self.pig_script

def pig_parameters(self):
    '''
    Set Pig input parameter strings here.
    '''
    return {
        'INPUT': self.input_directory,
        'HIVE_TABLE': self.hive_table
    }

def run(self):
    '''
    This is the important part. You basically tell the run method to run the Pig
    script. Afterwards you do what you want to do. Basically you want to write an
    empty output file - or in this case you write "SUCCESS" to the file.
    '''
    luigi.contrib.pig.PigJobTask.run(self)
    with self.output().open('w') as f:
    f.write("SUCCESS")

How to Write a Command Line Tool in Python

Scope and Prerequisites

This rather long blog entry basically consists of two parts:

  • In the first part “Motivation” we will learn a few reasons on why to wrap a command line tool (in Python) around an existing REST interface.
  • If you are not interested in that, but want to know how to build a command line tool skip to the second part – “Ingredients“, “Project Structure” and “Installation“.
  • There, we will learn what we need to create a most basic and simple command line tool, that will enable us to query the publicly available Pokéapi which is a RESTful Pokémon API. We will name the tool “pokepy”. It will retrieve the name of a Pokémon from the Pokéapi based on the Pokémon’s number. From there you can go ahead and write a more complex and extensive command line tool yourself with your own custom logic and your own data source API.

The interface will be as easy as calling

pokepy pokemon id=1

This educational tool is available on https://github.com/condla/pokepy. You only need basic Python knowledge to follow along.

Capture2
The Pokeapi REST service: https://pokeapi.co/

Motivation

Writing a command line tool can be very handy for various reasons – not only to easily obtain Pokémon information. Imagine you have a data source available as RESTful API, such as the Pokéapi. If you wanted to use an API like this just to look up information occasionally, you could put an often quite long query into your browser, fill in the parameters and press enter. The result would show in your browser. Often a REST API exposes more information than you actually need in your daily life and you would need to use your browser search function to get to the data point you need.

You could also use a command line tool like “curl” to query the API, which brings in another advantage. You can now send these requests within a bash script.

curl https://pokeapi.co/api/v2/pokemon/150/

For simple queries like this you could then parameterize the URL by setting it in an environment variable. This is easier to remember as well as easier and faster to type.

export POKEURL=https://pokeapi.co/api/v2/pokemon;
curl $POKEURL/150/

Now, why do we want to wrap something like this into a python command line tool, when the above command already looks so easy? There are several reasons:

  • We are only doing GET requests for now. Other APIs allow you to do all sorts of REST calls (PUT, POST, DELETE), which makes it complex to parameterize using environment variables. Wrapping it into Python logic makes the API once again more accessible and user friendly.
  • Also, if you have a look at the Pokéapi you notice that you can not only query for Pokémon, but also for types and abilities. This introduces another level of complexity in building the URL string with environment variables (https://pokeapi.co/api/v2/pokemon/, https://pokeapi.co/api/v2/type/, https://pokeapi.co/api/v2/ability/). This task can be tackled more elegantly in Python.
  • Additionally, a (Python) command line tool proves really useful, when you want to do REST calls against an API, that changes the state of the underlying system.
    This is easier to write, read, configure and memorize:

    example-tool put state=up
    

    than this:

    curl -H 'Content-Type: application/json' -X PUT -d '{state:up}'http://example.com/api/v2/service/
    
  • You can put a lot of custom logic into the command line tool to transform data, merge data from two or more different APIs, make calculations and customize the output to be either human or machine readable, or both.

Ingredients

We will be using the docopt module as a command line argument parser, as well as requests to send the request to the Pokéapi. We will also need to have python-pip installed. Python pip can be installed easily via your favourite package manager. On Ubuntu you would do:

sudo apt-get install python-pip

There are many other libraries out there to parse command line arguments or send HTTP requests. This should merely serve as an example.

Project Structure

The minimum requirements on the project structure are the following.

pokepy/
├── pokepy
│  ├── __init__.py
└── setup.py 

In my github repo you see a few more files, which are necessary to put the module into the Python Package Index. More on packaging a module can be found here.

The following sections explain and describe the essence of these files:

pokepy/__init__.py

This file serves as the entry point of our command line tool, it is also the required file to specify that this is actually a module and it contains all of our logic. Usually, we would separate these three things, but for simplicity we just keep it in one file. Below you can see the code:

  • Since we are using docopt, lines 1 to 8 completely define the usage of the command line interface. If an end user does not follow the rules defined in this doc string interface, the usage doc string will be printed to the screen.
  • The entry point of the script is on line 51.
  • On line 55 we import the docopt module.
  • If end users follow the rules defined in the doc string, the command line arguments will be parsed on line 56.
  • Lines 57 and 58 read out the parsed command line arguments, by calling the two functions on lines 17 and 29.
  • On line 59 the actual logic of the tool “call_pokeapi(path, id_number)” is called.
  • call_pokeapi(path, id_number) builds the URL and utilizes the requests module to do the REST call. If the default key “name” exists in the REST call, the value of the json response is returned. If the default key “name” does not exist, the assumption here is that we are out of range of existing Pokemons and therefore receive an error message response. This response has only one key: “detail”. In this case we print out the value of “detail” (which is expected to be “Not found.” 🙂 )
'''
Usage:
    pokepy (pokemon | type | ability) --id=ID

Options:
    -i --id=ID # specify the id of the pokemon, type or ability
    -h --help # Show this help
'''

import requests

POKEAPI = 'https://pokeapi.co/api/v2/{path}/{id}'

def get_api_path(arguments):
    '''
    Get pokemon or type or ability command from command line
    arguments.
    '''
    paths = ['pokemon', 'type', 'ability']
    for path in paths:
        if arguments[path]:
            break
    return path

def get_id(arguments):
    '''
    Get id from command line arguments.
    '''
    return arguments['--id']

def call_pokeapi(path, id_number, key='name'):
    '''
    Call the RESTful PokeAPI and parse the response. If pokemon, ability or
    type ids are not found than the error message detail is returned.
    '''
    url = POKEAPI.format(path=path, id=id_number)
    response = requests.get(url)
    response_json = response.json()
    try:
        res = response_json[key]
    except:
        res = response_json['detail']
    return res

def __main__():
    '''
    Entrypoint of command line interface.
    '''
    from docopt import docopt
    arguments = docopt(__doc__, version='0.1.0')
    path = get_api_path(arguments)
    id_number = get_id(arguments)
    print(call_pokeapi(path, id_number))

setup.py

Now we need to tell Python, that we want to use our module as a command line tool, after installing it. Have a look at the code below:

  • Lines 1 to 9 are basically boiler plate and don’t do much.
  • Then the setup method is called with a lot of partly self explanatory and partly boring parameters. What we really need here are the following two parameters:
    • install_requires where we specify a list of dependencies that will be installed by pip, if the requirements are not already satisfied.
    • entry_points where we specify an entry point “console_scripts” in a dictionary. The value pokepy=pokepy:__main__ means, that when we call “pokepy” from the command line, the __main__ method of the pokepy module will be called.
'''
pokepy setup module
'''

from setuptools import setup, find_packages
from codecs import open
from os import path

here = path.abspath(path.dirname(__file__))

setup(
    name='pokepy',
    version='0.1.0',
    description='A Pokeapi wrapper command line tool',
    long_description=long_description,
    url='https://github.com/condla/pokepy',
    author='Stefan Kupstaitis-Dunkler',
    author_email='email.address@gmail.com',
    license='Apache 2.0',
    classifiers=[
        'Development Status :: 3 - Alpha'
    ],

    keywords='Pokeapi REST client wrapper command line interface',
    packages=find_packages(),
    install_requires=['docopt', 'requests'],
    extras_require={},
    package_data={},
    package_data={},

    entry_points={
        'console_scripts': [
            'pokepy=pokepy:__main__',
        ],
    },
)

Installation

The only thing that’s left is to install our tool and put it to use. I would recommend you to do it in an own virtual environment, but it is not mandatory. In the project directory do:

# this will create a new virtual python environment in the env directory
virtualenv env
# this will activate the environment (now you can install anything into this environment without affecting the rest of the environment)
source env/bin/activate
# install the pokepy module into your virtual environment
pip install -e .

Congratulations you can now go ahead and use your command line tool for example like this (the $ symbol represents the command prompt):

$ pokepy pokemon -i 25
pikachu

Conclusion

We saw why it is useful to wrap an API into a command line interface and how it is done in Python. Now you know everything to go ahead and create more useful tools with a more complex logic by just extending this module fit to your needs.

My Impressions of the Hadoop Summit Dublin 2016

The Hadoop Summit is a tech-conference hosted by Hortonworks, being one of the biggest Apache Hadoop distributors, and Yahoo, being the company in which Hadoop was born. Software developers, consultants, business owners, administrators, that have a mutual interest in Hadoop and the technologies of its ecosystem, all gathered in Dublin – this year’s Hadoop Summit of Europe took place in Ireland. The Hadoop Summit 2016 Dublin had some great  keynotes, plenty of time to network and a lot of exciting talks about bleeding edge technology, its use cases and success stories. Also it was a great opportunity for companies working with Hadoop to present themselves and for the visitors to get to know them.

20160414_103139
Keynote: “Data is Beautiful”

The organisation of the conference was great. 1300 people participated, but it never felt crowded, nor were there any (big) waiting lines to enter the speaker rooms or at the lunch buffet.

My Favorite Talks

This is a list of my favorite talks in a chronological order with their videos embedded. To be honest, this list is basically almost all of the talks that I saw in person and probably I missed even more great talks, that were given in parallel. Fortunately, we can see all of them on the official Hadoop Summit 2016 Dublin Youtube channel.

  • SQL streaming: This talk gave a really nice overview of the development of an SQL streaming solution with all its technical challenges and how they were addressed. Also simple technical use cases were discussed and compared to traditional SQL, where each query terminates, whereas streaming SQL queries never terminate.

  • Hadoop at LinkedIn: Here we got valuable insights into the Hadoop landscape of LinkedIn, as well as job monitoring and automated health checks. A job monitoring tool, Dr. Elephant, developed by LinkedIn was open sourced only a few days before the start of the Summit.

  • IMG-20160423-WA0000Containerization at Spotify:  This talk was about how Spotify uses docker containers and the tools involved in their automated IT landscape. The best part starts at 39:30, where it is revealed, that Spotify overcomes security challenges by not implementing internal security measurements at all. According to the speaker everyone can access everyones data. If life could always be as simple as that 🙂

  • Apache Zeppelin + Apache Livy: Apache Zeppelin already is a great tool for interactive data analysis, exploration or even doing ETL tasks using Apache Pig, querying data using Apache Hive, as well as executing Python, R or bash scripts. Apache Livy helps data scientists work together in one notebook on a secure cluster. What I like a lot about this talk is, that the speakers nicely explain the authentication mechanism involved.

  • Apache Phoenix: Apache Phoenix is a SQL query engine on top of Apache HBase and much more. This talk was basically a view on the capabilities and features of Apache Phoenix. Great stuff – nothing more to add. Watch the video!

10 Years of Hadoop Party

In the night of day one, the Guinness storehouse was utilized as a huge burger-beer-and-big-data networking event. As you can imagine there was good food, Guinness, great music by Irish bands on several floors and of course most importantly the same cool people attending the conference.

IMG_20160413_200647.jpg
Author in the Guinness storehouse

Summary

My first Hadoop Summit attendance was a great experience in all its particulars. I got great contacts, gained lots of knowledge and had lots of fun at the same time. Hopefully, I will be able to attend the next Hadoop Summit 2017 in Munich.