In order to continue the series “Big Data and Stream Processing 101”, I created a little video to show the most basic operation of Apache Ranger: Apache Ranger can do much more, but the most basic operation is creating a single policy for a single resource and a single user.
It’s not best practice to give single users access to certain resources. For productive setups you’d look more into giving certain roles access to certain resources or tags. But that’s not being discussed here.
While the number of tools in the Open Source Big Data and Streaming Ecosystem still grows, frameworks that are around for a long time become highly mature and feature rich, some may say “enterprise ready”. Thus, it’s not surprising to me to see a lot of my customers who are new to the whole ecosystem are struggling understanding the basics of each of these tools. The first question always is, “When do I use which tool?”, but this is often not enough without having seen a certain tool in action.
This and a tweet that I recently stumbled upon, were motivation enough for me to explain the most basic things you can do with these tools. Each future blog post will contain a description of the most basic operation of exactly one tool and a detailed explanation of this and only this basic operation and none of the advance features, that might confuse beginners.
In this first blog post of the series, I want to categorise the tools – as good as possible – and describe each of them with as few words as possible, ideally less than a full sentence. Then I’ll introduce each of these tools in arbitrary order in subsequent blog posts. This should help anybody get started and then attend trainings or do self study to get to know all the features that are supposed to make our lives easier processing and managing data, and lower the barrier to get started with each of these.
Disclaimer: this overview is highly opinionated, most probably biased by my own experience and definitely incomplete ( = not exhaustive). I’m definitely up for discussion and open for questions on why I put a certain tool into a certain category and why certain categories are named as they are. So don’t hesitate to reach out to me 🙂
Tools and Frameworks
I’m re-using the slides that I recently created for a “lunch and learn” session. You’ll notice that a lot if not all of the tools appear in multiple categories.
( ) parentheses mean that I had some issues and spent some time considering if I really would put this tool in a certain category, because it strictly doesn’t fit. I probably put it there, because it *can* be used or it is often *used in combination* with tools in this category.
[ ] parentheses mean that the tool is not very popular anymore. It might still be supported, highly used and mature, but is just not popular anymore and likely to fade away and being replace by another tool.
All – Categorized by Function
“All” doesn’t mean every past and future existing tool in the ecosystem. All in this article means just all the tools that I consider and that are available in one of the distributions of HDP, CDH or the new Cloudera Data Platform (CDP)
Note: “Technical Frameworks” are not frameworks you’d work with on a daily base or at all. They’re just there and enable the rest of the cluster to work properly or enable certain features. All of the frameworks/tools/projects in this category are very different from each other.
Processing – Categorised by Speed of Data
Here “Data at Rest” means, that data could possibly be old, historic data, while “Streaming Data” considers event based/stream processing – processing of data while it’s on it’s why from creation at the source to the final destination. The final destination could be a “Data at Rest” persistence engine/database.
Databases – Categorised by Latency
Latency here could refer to two different things:
How up-to-date the data in the database is
How long a query to the database takes to respond with the results
I don’t distinguish those two in this categorisation, which would make this exercise a bit too detailed and tedious. Generally, it’s important to consider both to choose an adequate database for a certain use case.
All – Categorised by Use Case
I chose four typical use cases for this categorisation. A lot of other use cases can be realized
List of All Tools and Frameworks
Again, “All” doesn’t mean all tools currently available in the open source big data ecosystem. “All” means the bulletproof, tested, compatible set of components that easily cover the most common Big Data and Streaming use cases.
Apache NiFi: Manage data flows; get data from A to B and process it on the way with a UI
Apache Spark: Use dataframes to extract, transform and load data, train and evaluate ML models.
Apache Storm is a real-time, fault-tolerant, event-based streaming framework and platform that runs your code in a highly parallelized way on distributed nodes. It’s all about Spouts (processing units to read from data sources) and Bolts (general processing units). Storm is often used to read data from Apache Kafka and write the results back to Kafka or to a data store. Apache Storm and Apache Kafka are the work horses of the cyber security platform Apache Metron. Storm is also being used internally by the Streaming Analytics Manager (SAM)
This article guides you through the debugging process and points you to the places you need to tweak your configuration to get your topology up and running in a kerberized environment in case certain errors occur. For basic information on how to authenticate your application check out the reference implementation by Pierre Villard on his Github page.
I assume that you start from a certain point:
Your Storm cluster and the services you communicate with (Kafka, Zookeeper, HBase) is up and running as well as secure, i.e., the authentication happens through the Kerberos protocol.
Your Storm cluster is configured to run topologies as the OS user corresponding to the Kerberos principal who submitted the topology. (See: “Run worker processes as user who submitted the topology” in the excellent article of the Storm documentation)
Your topology (written in Java) is ready to be deployed and authentication is put in place.
Use the Storm UI to check if the topology’s workers are throwing any errors and on which machine they are running! The worker’s log files are stored on the machine the worker is running in /var/log/storm/workers-artifacts/<topology-name><unique-id>/<port-number>/worker.log.
Check the input data and output data of your Storm topology. In case you are using Kafka, connect via the Kafka console consumer and read from the input and the output topic of your topology! If you don’t see any events in the input Kafka topic, you should check upstream for errors. If you do see input events, but no output events, refer to your topology logs described in the item above. If you do see output events, check if they have the expected format (data format, number and kind of fields are correct, fields contain data that makes sense as opposed to null values)
# List Kafka topics:
bin/kafka-topics.sh --zookeeper <zookeeper.hostname>:<zookeeper.port> --list
# print messages as they are written on stdout from input topic
bin/kafka-console-consumer.sh --bootstrap-server <kafka.broker.hostname>:<kafka.broker.port> --topic input
# print messages as they are written on stdout from output topic
bin/kafka-console-consumer.sh --bootstrap-server <kafka.broker.hostname>:<kafka.broker.port> --topic output
Possible Error Scenarios
Authentication Errors Exception
Caused by: javax.security.auth.login.LoginException: Could not login: the client is being asked for a password, but the Kafka client code does not currently support obtaining a password from the user. not available to garner authentication information from the user
Your topology is being submitted and the supervisor tries to start and initialize the Spouts and Bolts in the worker process based on the configuration you provided. When this error occurs the worker process is killed and the supervisor tries to spawn a new worker process. On the machine the worker is supposed to run, you can see a worker process popping up with a certain PID (ps aux | grep <topology_name>). A few seconds later this process is killed and a process exactly as the old one is started with a different PID. You can also tail the worker log and see this error message. Soon afterwards the “Worker has died” message appears. This can happen for various reasons:
The OS user running the topology does not have the permission to read the keytabs configured in the jaas config file. Check with ps auxor top which user is running and check if the keytab has the correct POSIX attributes. Usually it should be read-only by the owning user (-r– — — <topology-user><topology-user>)
The jaas configuration points to the wrong keytabs to be used for authentication and the OS user does not have permission to those. Check with ps aux which jaas file is configured. You might find an option there. Check if this jaas config file has the desired authentication options configured. If not configure your own and pass it to the topology.
Hadoop is old, everyone has their own Hadoop cluster and everyone knows how to use it. It’s 2018, right? This article is just a collection of a few gotchas, dos and don’ts with respect to User Management that shouldn’t happen in 2018 anymore.
Just a few terms and definitions so that everyone is on the same page for the rest of the article. Roll your eyes and skip that section if you are an advanced user.
OS user = user that is provisioned on the operating system level of the nodes of a Hadoop cluster. You can check if the user exists on OS level by doing
KDC = Key Distribution Center. This might be a standalone KDC implementation, such as the MIT KDC or an integrated one behind a Microsoft Active Directory.
Keytab = file that stores the encrypted password of a user provisioned in a KDC. Can be used to authenticate without the need of typing the password using the “kinit” command line tool.
Make sure your users are available on all nodes of the OS, as well as in the KDC. This is important for several reasons:
When you run a job, the job might create staging/temporary directories in the /tmp/ directory, which are owned by the user running the job. The name of the directory is the name of the OS user, while the ownership belongs to authenticated user. In a secure cluster the authenticated user is the user you obtained a Kerberos ticket for from the KDC.
Keytabs on OS level should be only readable by the user OS user who is supposed to authenticate with them for security purposes.
When impersonation is turned on for services, e.g., Oozie using the -doas tag, Hive using the property hive.server2.enable.doAs=True property or Storm using the supervisor.run.worker.as.user=true property, a user authenticated as a principal will run on the OS level as a processs owned by that user. If that user is not known to the OS, the job will fail (to start).
Don’t use the hdfs user to run jobs on YARN (it’s forbidden by default and don’t change that configuration). Your problem can be solved in a different way! Only use the hdfs user for administrative tasks on the command line.
Don’t run Hive jobs as the “hive” user. The “hive” user is the administrative user and if at all should only be used by the Hadoop/database administrator.
Or in general: Don’t use the <service name> user to do <operation> on <service name>. You saw that coming, hm?
How to Achieve Synchronisation of KDC and OS Level
(…or other user/group management systems). This is a tricky one, if you don’t want to run into a split brain situation, where one system knows one set of users and another one knows others, which may or may not overlap.
Automate user provisioning, e.g., by using an Ansible role that provisions a user in the KDC and on all nodes of the Hadoop cluster.
Use services such as SSSD (System Security Services Daemon) that integrates users and groups from user and group management services into the operating system. So you won’t need to actually add them to each node, as long as SSSD is up and running.
Manually create OS users on all nodes and in the KDC (don’t do that, obviously ;P )
Factorio is a computer game. You probably ask yourself, in which ways a computer game is related to this blog? Well, not at all – or is it? Let’s find out.
Basically, in the game you take over the role of a character in 3rd person perspective, whose rocket ship crashed on a foreign planet. You don’t have anything, but a pick axe in the beginning. Your goal is to repair your rocket ship to go home safely. In the course of the game you’ll go from picking resources manually and crafting items manually to building mines, creating items in factories and connecting them using conveyor belts. Later on you even built your own power plants, electric networks and trains. The game is not forcing you to automate everything, but if you didn’t you probably wouldn’t reach your goal in this lifetime. In the game’s “About” section you can find how this goal is supposed to be achieved:
You will be mining resources, researching technologies, building infrastructure, automating production and fighting enemies. Use your imagination to design your factory, combine simple elements into ingenious structures, apply management skills to keep it working and finally protect it from the creatures who don’t really like you.
Does that sound familiar to you? No? – Then read it again and think about it as if it were a very honest job description for an engineering role in a large corporation which just started to “explore” Agile and DevOps methodologies. This is when I started to think about analogies between the game and IT projects.
What the game taught me about IT automation
A lot of those points feel super logical just writing them down, and not even worth mentioning, but then again, why don’t companies start acting on them?
You can only succeed if you keep automating processes, automate those processes that save you the most time at the moment and keep automating what you automated and combine automated processes.
You can’t completely avoid doing and fixing things manually, nor should you. There’s always the aliens living on the planet* destroying your infrastructure and pipelines. In the beginning before you build repair robots, you need to do a lot of manual repair work. Also, keep in mind to not automate everything. Keep doing things manually that don’t actually justify the time you spend automating processes at that certain point of time. You may want to automate those processes later on, but focus on the important (= most repetitive and time-consuming) ones first. *Think about those aliens as an analogy to either software bugs or users or to colleagues who don’t want to move away from waterfall methodology. 😛
If you are not the only one working on a project you need to talk to each other and continuously update each other on your (changing) plans. You need to define interfaces and locations where your infrastructure and processes can interact with the infrastructure and processes of your co-workers.
You can only succeed if you tear down a working implementation to build something more stable and scalable, even if it means that you will invest lots of resources and time. That doesn’t mean that you shouldn’t have built that piece of infrastructure at all. And it also doesn’t mean that the new piece of infrastructure will be there forever.
I’m not sure myself anymore if I am talking about the game or real-life IT. Thus, it seems to be a working analogy. If you know both the game and real-life DevOps methodologies go ahead and post a comment if you can think about something that is missing in my list above.
This analogy is interesting enough to start checking out more analogies of the game. I’ll create one or two future blog posts on those analogies, such as multi-threading and databases.