How to Write a Marker File in a Luigi “PigJobTask”

This is supposed to be a brief aid to memory on how to write marker files, when using “Luigi“, which I explained in a former blog post.

What is a Marker File?

A marker file is an empty file created with the sole purpose of signalizing to another process or application that some process is currently ongoing or finished. In the context of scheduling using Luigi, a marker file signalizes the Luigi scheduler that a certain task of a pipeline has already been finished and does not need to (re-)run anymore.

How the Common Luigi Job Rerun Logic Works

Every Luigi task has a run method. In this run method you can use any sort of (Python) code you desire. You can access the input and output streams of the Task object and use it to write data to the output stream. The principle is that a Luigi Task will not run again, if the file with the filename defined in the output target already exists. This can be either a LocalTarget (local file) or an HDFSTarget (file saved to HDFS) or any other custom target. That’s basically it.

How to Write a Marker File in a PigJobTask

Using a PigJobTask, the idea is that you run a Pig script of any complexity. You define the input and output files in your pig script. In the Luigi pipeline, you basically define the pig script location that you want to run and optionally a few other parameters depending on your Hadoop cluster configuration, but you don’t need to implement the run method anymore.

The scenario is that you do not have access to the HDFS output directory, e.g. because its the Hive warehouse directory or the Solr index directory,… or you simply can’t determine the output name of the underlying MapReduce job. So you need to “manually” create an empty file locally or in HDFS that signalizes Luigi that the job already has successfully run. You can specify an arbitrary output file in the output method. This will not create a marker file yet. The trick is to implement the run method specify explicitly to execute the pig script and do arbitrary stuff, such as creating a marker file, afterwards in the method.

You can see a sample PigJobTask that utilizes this technique below

class HiveLoader(luigi.contrib.pig.PigJobTask):
'''
Pig script executor to load files from HDFS into a Hive table (can be Avro, ORC,....)
'''

input_directory = luigi.Parameter()
hive_table = luigi.Parameter()
pig_script = luigi.Parameter()
staging_dir = luigi.Parameter(default='./staging_')

def requires(self):
return DependentTask() # requirement

def output(self):
'''
Here the output file that determines if a task was run is written.
Can be LocalTarget or HDFSTarget or ...
'''
return luigi.LocalTarget(self.staging_dir + "checkpoint")

def pig_options(self):
'''
These are the pig options you want to start the pig client with
'''
return ['-useHCatalog']

def pig_script_path(self):
'''
Execute pig script.
'''
return self.pig_script

def pig_parameters(self):
'''
Set Pig input parameter strings here.
'''
return {'INPUT': self.input_directory,
'HIVE_TABLE': self.hive_table
}

def run(self):
'''
This is the important part. You basically tell the run method to run the Pig
script. Afterwards you do what you want to do. Basically you want to write an
empty output file - or in this case you write "SUCCESS" to the file.
'''
luigi.contrib.pig.PigJobTask.run(self)
with self.output().open('w') as f:
f.write("SUCCESS")

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s