Basics of Hadoop User Management

Hadoop is old, everyone has their own Hadoop cluster and everyone knows how to use it. It’s 2018, right? This article is just a collection of a few gotchas, dos and don’ts with respect to User Management that shouldn’t happen in 2018 anymore.

Terminology

Just a few terms and definitions so that everyone is on the same page for the rest of the article. Roll your eyes and skip that section if you are an advanced user.

  • OS user = user that is provisioned on the operating system level of the nodes of a Hadoop cluster. You can check if the user exists on OS level by doing
id username
  • KDC = Key Distribution Center. This might be a standalone KDC implementation, such as the MIT KDC or an integrated one behind a Microsoft Active Directory.
  • Keytab = file that stores the encrypted password of a user provisioned in a KDC. Can be used to authenticate without the need of typing the password using the “kinit” command line tool.

Do

  • Make sure your users are available on all nodes of the OS, as well as in the KDC. This is important for several reasons:
    • When you run a job, the job might create staging/temporary directories in the /tmp/ directory, which are owned by the user running the job. The name of the directory is the name of the OS user, while the ownership belongs to authenticated user. In a secure cluster the authenticated user is the user you obtained a Kerberos ticket for from the KDC.
    • Keytabs on OS level should be only readable by the user OS user who is supposed to authenticate with them for security purposes.
    • When impersonation is turned on for services, e.g., Oozie using the -doas tag, Hive using the property hive.server2.enable.doAs=True property or Storm using the supervisor.run.worker.as.user=true  property, a user authenticated as a principal will run on the OS level as a processs owned by that user. If that user is not known to the OS, the job will fail (to start).

Don’t

  • Don’t use the hdfs user to run jobs on YARN (it’s forbidden by default and don’t change that configuration). Your problem can be solved in a different way! Only use the hdfs user for administrative tasks on the command line.
  • Don’t run Hive jobs as the “hive” user. The “hive” user is the administrative user and if at all should only be used by the Hadoop/database administrator.
  • Or in general: Don’t use the <service name> user to do  <operation> on <service name>. You saw that coming, hm?

How to Achieve Synchronisation of KDC and OS Level

(…or other user/group management systems). This is a tricky one, if you don’t want to run into a split brain situation, where one system knows one set of users and another one knows others, which may or may not overlap.

  • Automate user provisioning, e.g., by using an Ansible role that provisions a user in the KDC and on all nodes of the Hadoop cluster.
  • Use services such as SSSD (System Security Services Daemon) that integrates users and groups from user and group management services into the operating system. So you won’t need to actually add them to each node, as long as SSSD is up and running.
  • Manually create OS users on all nodes and in the KDC (don’t do that, obviously ;P )

Maybe I’ll expand that list in the future 🙂

4 Things Factorio Taught Me about DevOps

What is Factorio?

Factorio is a computer game. You probably ask yourself, in which ways a computer game is related to this blog? Well, not at all – or is it? Let’s find out.

maxresdefault

Basically, in the game you take over the role of a character in 3rd person perspective, whose rocket ship crashed on a foreign planet. You don’t have anything, but a pick axe in the beginning. Your goal is to repair your rocket ship to go home safely. In the course of the game you’ll go from picking resources manually and crafting items manually to building mines, creating items in factories and connecting them using conveyor belts. Later on you even built your own power plants, electric networks and trains. The game is not forcing you to automate everything, but if you didn’t you probably wouldn’t reach your goal in this lifetime. In the game’s “About” section you can find how this goal is supposed to be achieved:

You will be mining resources, researching technologies, building infrastructure, automating production and fighting enemies. Use your imagination to design your factory, combine simple elements into ingenious structures, apply management skills to keep it working and finally protect it from the creatures who don’t really like you.

Does that sound familiar to you? No? – Then read it again and think about it as if it were a very honest job description for an engineering role in a large corporation which just started to “explore” Agile and DevOps methodologies. This is when I started to think about analogies between the game and IT projects.

What the game taught me about IT automation

A lot of those points feel super logical just writing them down, and not even worth mentioning, but then again, why don’t companies start acting on them?

  • You can only succeed if you keep automating processes, automate those processes that save you the most time at the moment and keep automating what you automated and combine automated processes.
  • You can’t completely avoid doing and fixing things manually, nor should you. There’s always the aliens living on the planet* destroying your infrastructure and pipelines. In the beginning before you build repair robots, you need to do a lot of manual repair work. Also, keep in mind to not automate everything. Keep doing things manually that don’t actually justify the time you spend automating processes at that certain point of time. You may want to automate those processes later on, but focus on the important (= most repetitive and time-consuming) ones first.
    *Think about those aliens as an analogy to either software bugs or users or to colleagues who don’t want to move away from waterfall methodology. 😛
  • If you are not the only one working on a project you need to talk to each other and continuously update each other on your (changing) plans. You need to define interfaces and locations where your infrastructure and processes can interact with the infrastructure and processes of your co-workers.
  • You can only succeed if you tear down a working implementation to build something more stable and scalable, even if it means that you will invest lots of resources and time. That doesn’t mean that you shouldn’t have built that piece of infrastructure at all. And it also doesn’t mean that the new piece of infrastructure will be there forever.

I’m not sure myself anymore if I am talking about the game or real-life IT. Thus, it seems to be a working analogy. If you know both the game and real-life DevOps methodologies go ahead and post a comment if you can think about something that is missing in my list above.

This analogy is interesting enough to start checking out more analogies of the game. I’ll create one or two future blog posts on those analogies, such as multi-threading and databases.

 

 

How to Troubleshoot Apache Knox

Apache Knox is a gateway application and the door to access data in a data lake hidden behind a firewall. While the usage is fairly simple the setup, configuration and debugging process can be tedious due to many different components that Apache Knox ties together. I published this article also on Hortonworks Community Connection.

15403-17-05-08-security-concepts-knox
Common Security Architecture Using Apache Knox for Data Access

Start Small

  • First try to access the service directly before you go over Knox. In many cases, there’s nothing wrong with your Knox setup, but with either the way you setup and configured the service behind Knox or the way you try to access that service.
  • When you are familiar on how to access your service directly and when you have verified that it works as intended, try to do the same call on Knox.

Example:

  • You want to check if webhdfs is reachable so you first verify directly at the service and try to get the home directory of the service.
curl --negotiate -u : http://webhdfs-host.field.hortonworks.com:50070/webhdfs/v1/?op=GETHOMEDIRECTORY
  • If above request gives a valid 200 response and a meaningful answer you can safely check your Knox setup.
curl -k -u myUsername:myPassword https://knox-host.field.hortonworks.com:8443/gateway/default/webhdfs/v1/?op=GETHOMEDIRECTORY
  • Note: Direct access of WebHDFS and access of WebHDFS over Knox use two different authentication mechanisms: The first one uses SPNEGO which requires a valid Kerberos TGT in a secure cluster, if you don’t want to receive a “401 – Unauthorized” response. The latter one uses HTTP basic authentication against an LDAP, which is why you need to provide username and password on the command line.
  • Note 2: For the sake of completeness, I mention that here: Obviously, you direct the first request directly to the service host and port, while you direct your second request to the Knox host and port and specify which service.

The next section answers the question, what to do if the second command fails? (If the first command fails, go setup your service correctly and return later).

Security Related Issues

So what do the HTTP response codes mean for a Knox application? Where to start?

Authentication Issues

  • Very common are “401 – Unauthorized”. This can be misleading, since 401 is always tied to authentication – not authorization. That means you need to probably check one of the following items. Which of these items causes the error can be found in the knox log (per default /var/log/knox/gateway.log)
  • Is your username password combination correct (LDAP)?
  • Is your username password combination in the LDAP you used?
  • Is your LDAP server running?
  • Is your LDAP configuration in the Knox topology correct (hostname, port, binduser, binduser password,…)?
  • Is your LDAP controller accessible through the firewall (ports 389 or 636 open from the Knox host)?
  • Note: Currently (in HDP 2.6), you can specify an alias for the binduser password. Make sure, that this alias is all lowercase. Otherwise you will get a 401 response as well.

Authorization Issues

  • If you got past the 401s, a popular response code is “403 – Unauthorized”. Now this has actually really something to do with authorization. Depending on if you use ACL authorization or Ranger Authorization (which is recommended) you go ahead differently. If you use ACLs, make sure that the user/group is authorized in your topology definition. If you use Ranger, check the Ranger audit log dashboard and you will immediately notice two possible error sources:
    Your user/group is not allowed to use Knox.
    Your user/group is not allowed to use the service that you want to access behind Knox.
    Well, we came a long way and with respect to security we are almost done. One possible problem you could become is with impersonation. You need knox to be allowed to impersonate any user who access a service with knox. This is a configuration in core-site.xml: hadoop.proxyuser.knox.groups and hadoop.proxyuser.knox.hosts. Enter a comma separated list of groups and hosts that should be able to access a service over knox or set a wildcard *.
    This is what you get in the Knox log, when your Ranger Admin server is not running and policies cannot be refreshed.
2017-07-05 21:11:53,700 ERROR util.PolicyRefresher (PolicyRefresher.java:loadPolicyfromPolicyAdmin(288)) - PolicyRefresher(serviceName=condlahdp_knox): failed to refresh policies. Will continue to use last known version of policies (3) javax.ws.rs.ProcessingException: java.net.ConnectException: Connection refused (Connection refused)

This is also a nice example of Ranger’s design to not interfere with services if it’s down: policies will not be refreshed, but are still able operate as intended with the set of policies before Ranger crashed.

Application Specific Issues

Once you are past the authentication and authorization issues, there might be issues with how Knox interacts with its applications. This section might grow with time. If you have more examples of application specific issues, leave a comment or send me an email.

Hive

  • To enable Hive working with Knox, you need to change the transport mode from binary to http. It might be necessary in rare cases to not only restart Hiveserver2 after this configuration change, but also the Knox gateway.
  • This is what you get when you don’t switch the transport mode from “binary” to “http”. Binary runs on port 10000, http runs on port 10001. When binary transport mode is still active Knox will try to connect to port 10001 which is not available and thus fails with “Connection refused”.
2017-07-05 08:24:31,508 WARN hadoop.gateway (DefaultDispatch.java:executeOutboundRequest(146)) - Connection exception dispatching request: http://condla0.field.hortonworks.com:10001/cliservice?doAs=user org.apache.http.conn.HttpHostConnectException: Connect to condla0.field.hortonworks.com:10001 [condla0.field.hortonworks.com/172.26.201.30] failed: Connection refused (Connection refused) org.apache.http.conn.HttpHostConnectException: Connect to condla0.field.hortonworks.com:10001 [condla0.field.hortonworks.com/172.26.201.30] failed: Connection refused (Connection refused) at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:151) at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:353)
  • When you fixed all possible HTTP 401 errors for other services than Hive, but still get on in Hive, you might forget to pass username and password to beeline
beeline -u "<jdbc-connection-string>" -n "<username>" -p "<password>"
  • The correct jdbc-connection-string should have a format as in the example below:
jdbc:hive2://$KNOX_HOSTNAME:$KNOX_PORT/default;ssl=true;sslTrustStore=$TRUSTSTORE_PATH;trustStorePassword=$TRUSTSTORE_SECRET;transportMode=http;httpPath=gateway/default/hive
    • $TRUSTSTORE_PATH is the path to the truststore containing the knox server certificate, on the server with root access you could e.g. use /usr/hdp/current/knox-server/data/security/keystores/gateway.jks
    • $KNOX_HOSTNAME is the hostname where the Knox instance is running
    • $KNOX_PORT is the port exposed by Knox
    • $TRUSTSTORE_SECRET is the secret you are using for your truststore
  • Now, this is what you get, when you connect via beeline trying to talk to Knox from a different (e.g. internal) hostname than the one configured in the ssl certificate of the server. Just change the hostname and everything will work fine. While this error is not specifically Hive related, you will most of the time encounter it in combination with Hive, since most of the other services don’t require you to check your certificates. 
Connecting to jdbc:hive2://knoxserver-internal.field.hortonworks.com:8443/;ssl=true;sslTrustStore=truststore.jks;trustStorePassword=myPassword;transportMode=http;httpPath=gateway/default/hive 17/07/06 12:13:37 [main]: ERROR jdbc.HiveConnection: Error opening session org.apache.thrift.transport.TTransportException: javax.net.ssl.SSLPeerUnverifiedException: Host name 'knoxserver-internal.field.hortonworks.com' does not match the certificate subject provided by the peer (CN=knoxserver.field.hortonworks.com, OU=Test, O=Hadoop, L=Test, ST=Test, C=US)

HBase

  • WEBHBASE is the service in a Knox topology to access HBase via the HBase REST server. Of course, a prerequisite is that the HBase REST server is up and running.
  • Even if it is up and running it can occur that you receive an Error with HTTP code 503. 503: Unavailable. This is not related to Knox. You can track down the issue to a HBase REST server related issue, in which the authenticated user does not have privileges to e.g. scan the data. Give the user the correct permissions to solve this error.

Hadoop Security Concepts

While security is a quite complex topic by itself, security of distributed systems can be overwhelming. Thus, I wrote down a state of the art article about Hadoop (Ecosystem) Security Concepts and also published it on Hortonworks Community Connection.

In the documentation of the particular security related open source projects you can find a number of details on how these components work on their own and which services they rely on. Since the projects are open source you can of course check out the source code for more information. Therefore, this article aims to summarise, rather than explain each process in detail.

In this article I am first going through some basic component descriptions to get an idea which services are in use. Then I explain the “security flow” from a user perspective (authentication –> impersonation (optional) –> authorization –> audit) and provide a short example using Knox.

When reading the article keep following figure in mind. It depicts all the process that I’ll explain.

Component Descriptions and Concepts

Apache Ranger

Components and what they do:

  • Ranger Admin Service:
    • Provides RESTful API and a UI to manage authorization policies and service access audits based on resources, users, groups and tags.
  • Ranger User sync:
    • Syncs users and groups from an LDAP source (OpenLDAP or AD)
    • Stores users and groups in the relational DB of the Ranger service.
  • Ranger Plugins:
    • Service side plugin, that syncs policies from Ranger per default every 30 seconds. That way authorization is possible even if Ranger Admin does not run in HA mode and is currently down.
  • Ranger Tag Sync:
    • Syncs tags from Atlas meta data server
    • Stores tags in the relational DB of the Ranger service.
  • Ranger Key Management Service (KMS):
    • Provides a RESTful API to manage encryption keys used for encrypting data at rest in HDFS.
  • Supporting relational Database:
    • Contains all policies, synced users, groups, tags
  • Supporting Apache Solr instances:
    • Audits are stored here.

Documentation:

  • For the newest HDP release (2.6.0) use these Ranger Docs

Apache Atlas

Components:

  • Meta Data Server
    • Provides a RESTful API and a UI to manage meta data objects
  • Metastore
    • Contains meta data objects
  • Index
    • Maintains index to meta data objects

Documentation:

  • For the newest HDP release (2.6.0) use these Atlas Docs

Apache Knox

  • Knox serves as a gateway and proxy for Hadoop services and their UIs so that they can be accessible behind a firewall without requiring to open too many ports in the firewall.

Documentation:

  • For the newest HDP release (2.6.0) use these Knox Docs

Active Directory

Components:

  • Authentication Server (AS)
    • Responsible for issuing Ticket Granting Tickets (TGT)
  • Ticket Granting Server (TGS)
    • Responsible for issuing service tickets
  • Key Distribution Center (KDC)
    • Talks with clients using KRB5 protocol
    • AS + TGS
  • LDAP Server
    • Contains user and group information and talks with its clients using the LDAP protocol.
  • Supporting Database

Wire Encryption Concepts

To complete the picture I just want to mention that it is very important, to not only secure the access of services, but also encrypt data transferred between services.

Keystores and Truststores

To enable a secure connection (SSL) between a server and a client, first an encryption key needs to be created. The server uses it to encrypt any communication. The key is securely stored in a keystore for Java services JKS could be used. In order for a client to trust the server, one could export the key from the keystore and import it into a truststore, which is basically a keystore, containing keys of trusted services. In order to enable two-way SSL the same thing needs to be done on the client side. After creating a key in a keystore the client can access, put it into a trust store of the server. Commands to perform these actions are:

  • Generate key in "/path/to/keystore.jks" setting its alias to "myKeyAlias" and its password to "myKeyPassword". If the keystore file "/path/to/keystore.jks" does not exist, this will command will also create it.
keytool -genkey -keyalg RSA -alias myKeyAlias -keystore /path/to/keystore.jks -storepass myKeyPassword -validity 360 -keysize 2048
  • Export key stored in “/path/to/keystore.jks” with alias “myKeyAlias” into a file “myKeyFile.cer”
keytool -export -keystore /path/to/keystore.jks -alias myKeyAlias -file myKeyFile.cer
  • Import key from a file “myKeyFile.cer” with alias “myKeyAlias” into a keystore (that may act as truststore) named “/path/to/truststore.jks” using the password “trustStorePassword”
keytool -import -file myKeyFile.cer -alias myKeyAlias -keystore /path/to/truststore.jks -storepass trustStorePassword

Security Flow

Authentication

Only a properly authenticated user (which can also be a service using another service) can communicate successfully with a kerberized Hadoop service. Missing the required authentication, in this case by proving the identity of both user and the service, any communication will fail. In a kerberized environment user authentication is provided via a ticket granting ticket (TGT).

Note: Not using KERBEROS, but SIMPLE authentication, which is set up by default, provides any user with the possibility to act as any other type of user, including the superuser. Therefore strong authentication using Kerberos is highly encouraged.

Technical Authentication Flow:

  1. User requests TGT from AS. This is done automatically upon login or using the kinit command.
  2. User receives TGT from AS.
  3. User sends request to a kerberized service.
  4. User gets service ticket from Ticket Granting Server. This is done automatically in the background when user sends a request to the service.
  5. User sends service a request to the service using the service ticket.

Authentication Flow from a User Perspective:

Most of the above processes are hidden from the user. The only thing, the user needs to do before issuing a request from the service is to login on a machine and thereby receive a TGT or receive it programmatically or obtain it manually using the kinit command.

Impersonation

This is the second step after a user is successfully authenticated at a service. The user must be authenticated, but can then choose to perform the request to the service as another user. If everyone could do this by default, this would raise another security concern and the authentication process would be futile. Therefore this behaviour is forbidden by default for everyone and must be granted for individual users. It is used by proxy services like Apache AmbariApache Zeppelin or Apache Knox. Ambari, Zeppelin and Knox authenticate as “ambari”, “zeppelin”, “knox” users, respectively, at the service using their TGTs, but can choose to act on behalf of the person, who is logged in in the browser in Ambari, Zeppelin or Knox. This is why it is very important to secure these services.

To allow, for example, Ambari to perform operations as another user, set the following configs in the core-site.xml, hadoop.proxyuser.ambari.groups and hadoop.proxyuser.ambari.hosts, to a list of groups or hosts that are allowed to be impersonated or set a wildcard *.

Authorization

Authorization defines the permissions of individual users. After it is clear which user will be performing the request, i.e., the actually authenticated or the impersonated one, the service checks against the local Apache Ranger policies, if the request is allowed for this certain user. This is the last instance in the process. A user passing this step is eventually allowed to perform the requested action.

Audit

Every time the authorization instance is called, i.e., policies are checked if the action of a user is authorized or not, an audit event is being logged, containing, time, user, service, action, data set and success of the event. An event is not logged in Ranger in case a user without authentication tries to access data or if a user tries to impersonate another user, without having appropriate permissions to do so.

Example Security Flow Using Apache Knox

Looking at the figure above you can follow what’s going on in the background, when a user Eric wants to push a file into the HDFS service on path “/user/eric/” from outside the Hadoop cluster firewall.

  1. User Eric sends the HDFS request including the file and the command to put that file into the desired directory, while authenticating successfully via LDAP provider at the Apache Knox gateway using his username/password combination. Eric does not need to obtain a Kerberos ticket. In fact, since he is outside the cluster, he probably does not have access to the KDC through the firewall to obtain one anyway.
  2. Knox Ranger plugin checks, if Eric is allowed to use Knox. If he’s not, the process ends here. This event is logged in Ranger audits.
  3. Knox has a valid TGT (and refreshes it before it becomes invalid), obtains a service ticket with it and authenticates at the HDFS namenode as user “knox”.
  4. Knox asks the service to perform the action as Eric, which is configured to be allowed.
  5. Ranger HDFS plugin checks, if Eric has the permission to “WRITE” to “/user/eric”. If he’s not, the process ends here. This event is logged in Ranger audits.
  6. File is pushed to HDFS.

I hope this article helps to get a better understanding of the security concepts within the Hadoop Ecosystem. 

How to Write a Marker File in a Luigi “PigJobTask”

This is supposed to be a brief aid to memory on how to write marker files, when using “Luigi“, which I explained in a former blog post.

What is a Marker File?

A marker file is an empty file created with the sole purpose of signalizing to another process or application that some process is currently ongoing or finished. In the context of scheduling using Luigi, a marker file signalizes the Luigi scheduler that a certain task of a pipeline has already been finished and does not need to (re-)run anymore.

How the Common Luigi Job Rerun Logic Works

Every Luigi task has a run method. In this run method you can use any sort of (Python) code you desire. You can access the input and output streams of the Task object and use it to write data to the output stream. The principle is that a Luigi Task will not run again, if the file with the filename defined in the output target already exists. This can be either a LocalTarget (local file) or an HDFSTarget (file saved to HDFS) or any other custom target. That’s basically it.

How to Write a Marker File in a PigJobTask

Using a PigJobTask, the idea is that you run a Pig script of any complexity. You define the input and output files in your pig script. In the Luigi pipeline, you basically define the pig script location that you want to run and optionally a few other parameters depending on your Hadoop cluster configuration, but you don’t need to implement the run method anymore.

The scenario is that you do not have access to the HDFS output directory, e.g. because its the Hive warehouse directory or the Solr index directory,… or you simply can’t determine the output name of the underlying MapReduce job. So you need to “manually” create an empty file locally or in HDFS that signalizes Luigi that the job already has successfully run. You can specify an arbitrary output file in the output method. This will not create a marker file yet. The trick is to implement the run method specify explicitly to execute the pig script and do arbitrary stuff, such as creating a marker file, afterwards in the method.

You can see a sample PigJobTask that utilizes this technique below

class HiveLoader(luigi.contrib.pig.PigJobTask):
    '''
    Pig script executor to load files from HDFS into a Hive table 
    (can be Avro, ORC,....)
    '''

    input_directory = luigi.Parameter()
    hive_table = luigi.Parameter()
    pig_script = luigi.Parameter()
    staging_dir = luigi.Parameter(default='./staging_')

def requires(self):
    return DependentTask() # requirement

def output(self):
    '''
    Here the output file that determines if a task was run is written.
    Can be LocalTarget or HDFSTarget or ...
    '''
    return luigi.LocalTarget(self.staging_dir + "checkpoint")

def pig_options(self):
    '''
    These are the pig options you want to start the pig client with
    '''
    return ['-useHCatalog']

def pig_script_path(self):
    '''
    Execute pig script.
    '''
    return self.pig_script

def pig_parameters(self):
    '''
    Set Pig input parameter strings here.
    '''
    return {
        'INPUT': self.input_directory,
        'HIVE_TABLE': self.hive_table
    }

def run(self):
    '''
    This is the important part. You basically tell the run method to run the Pig
    script. Afterwards you do what you want to do. Basically you want to write an
    empty output file - or in this case you write "SUCCESS" to the file.
    '''
    luigi.contrib.pig.PigJobTask.run(self)
    with self.output().open('w') as f:
    f.write("SUCCESS")

Book Review: Learning Responsive Data Visualization

This post is about describing my experiences reading a book: “Learning Responsive Data Visualization” by Christoph Körner.

What is it all about?

The book aims to explain the concepts and application of responsive data visualization technologies. It describes the famous CSS framework from Twitter “Bootstrap“, SVG graphics and the JavaScript visualization framework D3.js.

The book has 9 chapters: starting from a short introduction of the components in use, it quickly enables the user to create their first visualization and increases the level of detail and complexity systematically. Later, it describes a combined usage of these components and presents techniques on how to create more elaborate layouts and animations.

In the end,  the book motivates and explains how to test visualization applications, as well as outlines how to solve cross-browser issues.

About the Author

From Amazon

Christoph Körner, CTO and lead developer at GESIM, a start-up company, is a passionate software engineer, web enthusiast, and an active member of the JavaScript community with more than 5 years of experience in developing customer-oriented web applications. He is the author of Data Visualizations with D3 and AngularJS and is currently pursuing his master’s degree in Visual Computing at Vienna Institute of Technology.

My opinion

Christoph uses short, concise descriptions. Instead of being verbose, he yields many links for further reading to official documentation or interesting blog entries by utilizing non-invasive text boxes throughout the chapters. The author understands how to direct the reader’s attention at the important parts of the technologies introduced.

Nevertheless, here and there, the author finishes a section with an outlook to an advanced topic that sometimes could have needed a little closer attention. An example of this can be found at the end of chapter 2, when the author mentions, that “D3 provided more useful methods on the generator functions”. He then names only one such method and describes it in one sentence. More useful would have been a small list of these methods or to provide yet another of the excellent code examples in the book.

What I really enjoyed is that the author follows the title of the book closely and visualizes not only the code examples but also graphically depicts the concepts and philosophy of the frameworks in use. This helped me a lot to understand the ideas.

One of the most important things of a textbook is to be simple and comprehensible. Christoph easily reaches these goals.

responsive_data_vis

Audience

In my opionon, you need some level of experience with HTML, CSS and JavaScript before you can get started.
Thus, I believe the book aims at developers of intermediate level. On the other hand, if you bring these prerequisites this book is aimed at beginners of D3.js.

Conclusion

At Amazon I rated the book with 4 stars: While I mentioned above, that I like that it is completely fact based and content focused, I kind of miss to get some historical information or funny side stories in footnotes or fact boxes. Instead, fact boxes are used efficiently to point to additional technical content. There are some rare 5-star-books out there that achieve to create this fine bridge of being educational and entertaining. “Learning Responsive Data Visualization” does not build this bridge, but delivers a solid book to teach yourself and others modern responsive data visualization.

How to Create a Data Pipeline Using Luigi

This is a simple walk-through of an example usage of Luigi. Online there is the excellent documentation of Spotify themselves. You can find all bits and bytes out there to create your own pipeline script. Also, there are already a few blog posts about what is possible when using Luigi, but then – I believe – it’s not very well described how to implement it. So, in my opinion there is either too much information to just try it out or too few information to actually get started hands-on. Also, I’ll mention a word about security.

Therefore, I publish a full working example of a minimalist pipeline from where you can start, copy and paste everything you need

These are the question I try to answer:

  • What is Luigi and when do I want to use it?
  • How do I setup the Luigi scheduler?
  • How do I specify a Luigi pipeline?
  • How do I schedule a Luigi pipeline?
  • Can I use Luigi with a secure Hadoop cluster?
  • What I like about Luigi?

What is Luigi?

Luigi is a framework written in Python that makes it easy to define and execute complex pipelines in a consistent way. You can use Luigi …

  • … when your data is processed in (micro) batches, rather than it is streamed
  • … when you want to run jobs that depend on (many) other jobs.
  • … when you want to have nice visualizations of your pipelines to keep a good overview.
  • … when you want to integrate data into the Hadoop ecosystem.
  • … when you want to do any of the above and love Python.

Create Infrastructure

Every pipeline can actually be tested using the --local-scheduler tag in the command line. But for production you should use a central scheduler running on one node.

The first thing you want to do is to create a user and a group the scheduler is running as.

groupadd luigi
useradd -g luigi luigi

The second step is to create a Luigi config directory.

sudo mkdir /etc/luigi
sudo chown luigi:luigi /etc/luigi

You also need to install Luigi (and Python and pip) if you did not do that already.

pip install luigi

It’s now time to deploy the configuration file. Put the following file into /etc/luigi/luigi.cfg. In this example the Apache Pig home directory of a Hortonworks Hadoop cluster is specified. There are many more configuration options listed in the official documentation.

[core]
default-scheduler-host=www.example.com
default-scheduler-port=8088

[pig]
home=/usr/hdp/current/pig-client

Don’t forget to create directories for the process id of the luigi scheduler daemon, the store log and libs.

sudo mkdir /var/run/luigi
sudo mkdir /var/log/luigi
sudo mkdir /var/lib/luigi
chown luigi:luigi /var/run/luigi
chown luigi:luigi /var/log/luigi
chown luigi:luigi /var/lib/luigi

You are now prepared to start up the scheduler daemon.

sudo su - luigi
luigid --background --port 8088 --address www.example.com --pidfile /var/run/luigi/luigi.pid --logdir /var/log/luigi --state-path /var/lib/luigi/luigi.state'

A Simple Pipeline

We are now ready to go. Let’s specify an example pipeline that actually can be run without a Hadoop ecosystem present: It reads data from a custom file, counting the number of words and writing the output to a file called count.txt. In this example two of the most basic task types are used: luigi.ExternalTask which requires you to implement the output method and luigi.Task which requires you to implement the requires, output and run methods. I added pydocs to all methods and class definitions, so the code below should speak for itself. You can also view it on Github.

import luigi

class FileInput(luigi.ExternalTask):
'''
Define the input file for our job:
The output method of this class defines
the input file of the class in which FileInput is
referenced in &quot;requires&quot;
'''

# Parameter definition: input file path
input_path = luigi.Parameter()

def output(self):
'''
As stated: the output method defines a path.
If the FileInput  class is referenced in a
&quot;requires&quot; method of another task class, the
file can be used with the &quot;input&quot; method in that
class.
'''
return luigi.LocalTarget(self.input_path)

class CountIt(luigi.Task):
'''
Counts the words from the input file and saves the
output into another file.
'''

input_path = luigi.Parameter()

def requires(self):
'''
Requires the output of the previously defined class.
Can be used as input in this class.
'''
return FileInput(self.input_path)

def output(self):
'''
count.txt is the output file of the job. In a more
close-to-reality job you would specify a parameter for
this instead of hardcoding it.
'''
return luigi.LocalTarget('count.txt')

def run(self):
'''
This method opens the input file stream, counts the
words, opens the output file stream and writes the number.
'''
word_count = 0
with self.input().open('r') as ifp:
for line in ifp:
word_count += len(line.split(' '))
with self.output().open('w') as ofp:
ofp.write(unicode(word_count))

if __name__ == &quot;__main__&quot;:
luigi.run(main_task_cls=CountIt)

Schedule the Pipeline

To test and schedule your pipeline create a file test.txt with arbitrary content.
We can now execute the pipeline manually by typing

python pipe.py --input-path test.txt

Use the following if you didn’t set up and configure the central scheduler as described above

python pipe.py --input-path test.txt -local-scheduler

If you did everything right you will see that no tasks failed and a file count.txt was created that contains the count of the words of your input file.

Try running this job again. You will notice that Luigi will tell you that there already is a dependency present. Luigi detects that the count.txt is already written and will not run the job again.

Now you can easily trigger this pipeline on a daily base by using, e.g., crontab in order to schedule the job to run, e.g., every minute. If your input and output file has the current date in the filename’s suffix, the job will be triggered every minute, but successfully run only exactly once a day.

In a crontab you could do the following:

1 * * * * python pipe.py --input-path test.txt

Security

The cool thing about Luigi is, that you basically don’t need to worry much about security. Luigi basically uses the security features of the components it interacts with. If you are, e.g., working on a secure Hadoop cluster (that means on a cluster, where Kerberos authentication is enforced) the only thing you need to worry about, is that you obtain a fresh Kerberos ticket before you trigger the job – given that the validity of the ticket is longer than the job needs to finish. I.e., when you schedule your pipeline with cron make sure you do a kinit from a keytab. you can check out my answer to a related question on the Hortonworks community connection for more details on that (https://community.hortonworks.com/questions/5488/what-are-the-required-steps-we-need-to-follow-in-s.html#answer-5490) .

What do I like about Luigi?

It combines my favourite programming language and my favourite distributed ecosystem. I didn’t go too much into that now. But Luigi is especially great because of its rich ways to interact with Hadoop Ecosystem services. Instead of a LocalTarget you would rather use HdfsTargets or Amazon S3Targets. You can define and run Pig jobs and there even is a Apache Hive client built in.