Big Data and Stream Processing 101 – Part 1: Overview of Tools and Frameworks

While the number of tools in the Open Source Big Data and Streaming Ecosystem still grows, frameworks that are around for a long time become highly mature and feature rich, some may say “enterprise ready”. Thus, it’s not surprising to me to see a lot of my customers who are new to the whole ecosystem are struggling understanding the basics of each of these tools. The first question always is, “When do I use which tool?”, but this is often not enough without having seen a certain tool in action.

This and a tweet that I recently stumbled upon, were motivation enough for me to explain the most basic things you can do with these tools. Each future blog post will contain a description of the most basic operation of exactly one tool and a detailed explanation of this and only this basic operation and none of the advance features, that might confuse beginners.

One motivation for me to write this blog post.

In this first blog post of the series, I want to categorise the tools – as good as possible – and describe each of them with as few words as possible, ideally less than a full sentence. Then I’ll introduce each of these tools in arbitrary order in subsequent blog posts. This should help anybody get started and then attend trainings or do self study to get to know all the features that are supposed to make our lives easier processing and managing data, and lower the barrier to get started with each of these.

Disclaimer: this overview is highly opinionated, most probably biased by my own experience and definitely incomplete ( = not exhaustive). I’m definitely up for discussion and open for questions on why I put a certain tool into a certain category and why certain categories are named as they are. So don’t hesitate to reach out to me 🙂

Tools and Frameworks

I’m re-using the slides that I recently created for a “lunch and learn” session. You’ll notice that a lot if not all of the tools appear in multiple categories.

( ) parentheses mean that I had some issues and spent some time considering if I really would put this tool in a certain category, because it strictly doesn’t fit. I probably put it there, because it *can* be used or it is often *used in combination* with tools in this category.

[ ] parentheses mean that the tool is not very popular anymore. It might still be supported, highly used and mature, but is just not popular anymore and likely to fade away and being replace by another tool.

All – Categorized by Function

“All” doesn’t mean every past and future existing tool in the ecosystem. All in this article means just all the tools that I consider and that are available in one of the distributions of HDP, CDH or the new Cloudera Data Platform (CDP)

Note: “Technical Frameworks” are not frameworks you’d work with on a daily base or at all. They’re just there and enable the rest of the cluster to work properly or enable certain features. All of the frameworks/tools/projects in this category are very different from each other.

Processing – Categorised by Speed of Data

Here “Data at Rest” means, that data could possibly be old, historic data, while “Streaming Data” considers event based/stream processing – processing of data while it’s on it’s why from creation at the source to the final destination. The final destination could be a “Data at Rest” persistence engine/database.

Databases – Categorised by Latency

Latency here could refer to two different things:

  • How up-to-date the data in the database is
  • How long a query to the database takes to respond with the results

I don’t distinguish those two in this categorisation, which would make this exercise a bit too detailed and tedious. Generally, it’s important to consider both to choose an adequate database for a certain use case.

All – Categorised by Use Case

I chose four typical use cases for this categorisation. A lot of other use cases can be realized

List of All Tools and Frameworks

Again, “All” doesn’t mean all tools currently available in the open source big data ecosystem. “All” means the bulletproof, tested, compatible set of components that easily cover the most common Big Data and Streaming use cases.

Apache NiFi: Manage data flows; get data from A to B and process it on the way with a UI

Apache Spark: Use dataframes to extract, transform and load data, train and evaluate ML models.

Apache Kafka: Publish, persist and subscribe to events

Cloudera Data Science Workbench: Explore data; use Python and Spark; deploy and manage ML models

Apache Hive: Data Warehousing engine; SQL on distributed/object storage

Apache Impala: Data Hub engine; SQL on distributed/object storage

Apache Kudu: Quickly analyse fast data

Apache Oozie: Schedule jobs

Apache Sqoop: Transfer data from and to relational databases

Apache Druid: Analyse realtime data with high performance

Apache Solr: Index and search text

Apache Flink: Process streams of data using a programming interface

Apache Storm: Process streams of data using a programming interface

Apache HBase: Consistent, scalable noSQL database; low latency look-ups; unstructured data

Apache Phoenix: Do SQL on top of HBase

Apache Zeppelin: Notebook application tightly integrated with Hive and Spark

Apache Ranger: Define policies to permit or restrict access to data

Apache Knox: Manage access to services

Apache Atlas: Manage meta data

Part 2: How to Create a Simple Ranger Policy

Meet the Hadoop User Group Vienna

A friend said, “Vienna needs a Hadoop User Group” and I agreed with him. The next step was to initialize a Meetup group. Meetup is a platform, where everyone can organize any kind of meetings for any kind of topic. Hadoop recently just started to gain a little traction in Austria and Vienna and I think it’s the perfect time to start a group like this.

This group is for everyone of any level of skill using Apache Hadoop who is located in Vienna. The focus of the group is clearly technical with an eye on use cases. I try to organize technical talks of Hadoop related vendors for the sessions. Also, I want to establish the opportunity working together on real world problems and get hands on Hadoop. In this group we will create a network of Hadoop Users, discuss recent and interesting (technical) topics, eat, drink and – most importantly – have fun together.

I’d like the group to be interactive and that everyone has the opportunity to contribute.

For the first Meetup on Wednesday, May 18, I plan to briefly introduce the goals of the group. I believe all members of the group should brainstorm together, on what all of us expect of the group in the future and try to figure out how often we should meet and which contents we want to work on.

My ideas on how it could look like in the future:

  • One of us could provide some code and walk the others through it. That way the experienced of us can provide feedback and give hints on what to improve and the less experienced gain knowledge.
  • We can define a project to work on together: e.g., building a Hadoop cluster together out of Raspberry Pis, writing streaming applications in Apache Storm or Apache Spark together, or whatever you want,…
  • I plan to combine the Meetup every now and then with the Vienna Kaggle Meetup and do a session about “Data Science and Hadoop”.
  • Similarly to the Vienna Kaggle group, I created a git organisation for code that we work on together. If you are interested to join, just contact me and I will give you access.

I am looking forward to getting to know you as well as hearing your ideas on what to contribute to the group.

 

My Impressions of the Hadoop Summit Dublin 2016

The Hadoop Summit is a tech-conference hosted by Hortonworks, being one of the biggest Apache Hadoop distributors, and Yahoo, being the company in which Hadoop was born. Software developers, consultants, business owners, administrators, that have a mutual interest in Hadoop and the technologies of its ecosystem, all gathered in Dublin – this year’s Hadoop Summit of Europe took place in Ireland. The Hadoop Summit 2016 Dublin had some great  keynotes, plenty of time to network and a lot of exciting talks about bleeding edge technology, its use cases and success stories. Also it was a great opportunity for companies working with Hadoop to present themselves and for the visitors to get to know them.

20160414_103139
Keynote: “Data is Beautiful”

The organisation of the conference was great. 1300 people participated, but it never felt crowded, nor were there any (big) waiting lines to enter the speaker rooms or at the lunch buffet.

My Favorite Talks

This is a list of my favorite talks in a chronological order with their videos embedded. To be honest, this list is basically almost all of the talks that I saw in person and probably I missed even more great talks, that were given in parallel. Fortunately, we can see all of them on the official Hadoop Summit 2016 Dublin Youtube channel.

  • SQL streaming: This talk gave a really nice overview of the development of an SQL streaming solution with all its technical challenges and how they were addressed. Also simple technical use cases were discussed and compared to traditional SQL, where each query terminates, whereas streaming SQL queries never terminate.

  • Hadoop at LinkedIn: Here we got valuable insights into the Hadoop landscape of LinkedIn, as well as job monitoring and automated health checks. A job monitoring tool, Dr. Elephant, developed by LinkedIn was open sourced only a few days before the start of the Summit.

  • IMG-20160423-WA0000Containerization at Spotify:  This talk was about how Spotify uses docker containers and the tools involved in their automated IT landscape. The best part starts at 39:30, where it is revealed, that Spotify overcomes security challenges by not implementing internal security measurements at all. According to the speaker everyone can access everyones data. If life could always be as simple as that 🙂

  • Apache Zeppelin + Apache Livy: Apache Zeppelin already is a great tool for interactive data analysis, exploration or even doing ETL tasks using Apache Pig, querying data using Apache Hive, as well as executing Python, R or bash scripts. Apache Livy helps data scientists work together in one notebook on a secure cluster. What I like a lot about this talk is, that the speakers nicely explain the authentication mechanism involved.

  • Apache Phoenix: Apache Phoenix is a SQL query engine on top of Apache HBase and much more. This talk was basically a view on the capabilities and features of Apache Phoenix. Great stuff – nothing more to add. Watch the video!

10 Years of Hadoop Party

In the night of day one, the Guinness storehouse was utilized as a huge burger-beer-and-big-data networking event. As you can imagine there was good food, Guinness, great music by Irish bands on several floors and of course most importantly the same cool people attending the conference.

IMG_20160413_200647.jpg
Author in the Guinness storehouse

Summary

My first Hadoop Summit attendance was a great experience in all its particulars. I got great contacts, gained lots of knowledge and had lots of fun at the same time. Hopefully, I will be able to attend the next Hadoop Summit 2017 in Munich.