Apache Metron as an Example for a Real Time Data Processing Pipeline

In my previous blog post I was writing a little bit about what Apache Metron is and How to Onboard a New Data Source in Apache Metron.

Now I want to shine some light on how the ingestion pipeline architecture looks like. Since I just got started with Apache Metron myself, I hope this helps to kickstart your cyber security efforts. Rather than going too much into the details of what the components do, I’d like to provide a basic overview about which components there are.

This architecture can be generalized for all kinds of streaming use cases. The pipeline uses Apache NiFi for ingest, Apache Kafka as an event buffer, Apache Storm for stream processing, Apache Hadoop for long term storage and Apache Solr for short term random access storage. If you design your own pipeline for a different use case, you can, e.g., swap Apache Storm with frameworks such as Apache Flink or Spark Streaming (or any other frameworks out in the wild with their pros and cons). Choosing the right piece of technology strongly depends on numerous factors, I’m not going into in this article.

metron_pipeline

End to End Processing Pipeline for Apache Metron

Ingest

The most important part for Apache Metron is to get the telemetry data into an Apache Kafka topic. In the figure below you can see that there is a Kafka topic and a corresponding parser for each format. Usually, there is one Kafka topic per source type, because each source typically comes in its own special format, but it’s also possible that data of one source has multiple formats or multiple sources have the same format.

metron_pipeline_ingest_closeup.png

  • Apache NiFi is being used as the data integration tool.
  • In the figure, I added an example of a MiNiFi instance to the Squid Access Log source. In this case MiNiFi is installed on the Squid server node and acts as a log forwarder.
  • It’s also possible that sources write directly into Kafka, if they support that. In some cases this might even be a requirement due to performance constraints.

Parsing

As described in the ingest part: there is a topic for each parser format and an Apache Storm topology reading from this Kafka topic and doing the parsing. A parsed event is then written into the so-called “enrichments” topic.

metron_pipeline_parsing_closeup

  • The parsing has two purposes:
    • it brings all ingest format into a JSON format.
    • it introduces a common set of fields shared among all data sources, as well as unique fields that are special to each source.
  • Some parsers of common formats are included in the Metron project.
  • If there is no parser (that works) for your format, you can use Grok to quickly prototype and launch your parser before you write it in Java.
  • It is also possible to launch parser chains to extract information that is convoluted in different formats.
  • You can also decide to run only one topology handling multiple parsers in a so-called aggregated parser. This can be combined with parser chains.

Enrichment

The purpose of the enrichment Storm topology is to pick up events from the enrichments topic and add information from external sources. The enriched output is written to an indexing topic.

metron_pipeline_enrichment_closeup

  • A typical enrichment is a lookup in a database to convert an IP address into geo information
  • The profiler uses sliding windows to create aggregates/statistics in certain time windows, so-called profiles.
  • These profiles can be used to enrich data.
  • Metron helps you use any data in HBase to enrich your events.

Persisting

There are two Storm topologies to read from the indexing topic that persist events, the batch indexing topology and the random access indexing topology. The first utilizes an HDFSBolt to write data to HDFS. The latter one indexes data in Apache Solr.

metron_pipeline_persisting_closeup

  • There is one Solr collection per data format.
    • This way the parsed fields and definitions are kept clean and separated.
    • Also, you can authorize different users and groups to different data sources. This is even easier with the Solr Plugin for Apache Ranger.
  • HDFS is used as long term storage for analytical purposes and to use the data to create machine learning models.
  • Solr is being used for direct fast random access and search capabilities, e.g. by the Metron Alerts UI. It makes sense to store the data for only a limited amount of time for performance reasons.
  • It’s quite easy to create a new collection. I’ve described it on this github gist. I’ve added properties in the solrconfig.xml to define a “time to live” for an event in Solr, after which the event will be deleted from the collection.
  • Instead of Solr, you can use Elastic Search.

Conclusion

I hope this can be useful for somebody, either trying to implement Metron or somebody interested in how modern streaming pipelines look like in general. If you have questions, don’t hesitate to ask the experts in the Metron mailing list (user@metron.apache.org) or get support from the Hortonworks Community.

How to Onboard a New Data Source in Apache Metron

Introduction

Apache Metron aims to be a tool for analysts in a cyber security team to help them defining intelligent alerts, detecting threats and work on them in real-time. This is the first blog post in a row to ease operations and share my experiences with Apache Metron. Thus, it serves as an introduction to Metron.

logo

Technical Introduction

Apache Metron is a cyber security platform making heavy use of the Hadoop Ecosystem to create a scalable and available solution. It utilizes Apache Storm and Apache Kafka to parse, enrich, profile, and eventually index data from telemetry sources, such as network traffic, firewall logs, or application logs in real-time. Apache Solr or Elastic Search are used for random access searches, while Apache Hadoop HDFS is used for long term and analytical storage. It comes with its own scripting language “Stellar” to query, transform and enrich data. A security operator/analyst uses the Metron Management UI to configure and manage input sources as well the Metron Alerts UI to search, filter and group events.

Screen Shot 2018-07-18 at 08.07.45

Metron Alerts UI, showing a few dummy events from a Squid log.

Scope of this Post

Since virtually every data source can be used to generate events, it is natural that the platform operator/analyst wants to add data from new sources over time. I use this post as a small check list, to document considerations for the “onboarding” process of new data sources. You might want to automate this process in a way that works for you. In future posts I will cover the steps in detail.

Onboard a New Data Source

I need to ingest data to Kafka

  • It’s very handy to use Apache NiFi for the ingest part. Just create a data flow consisting of two processors: a simple tcp listener to receive data and a Kafka producer to push the event further into Kafka.
  • I can also push data directly into Kafka if the architecture,  firewall and the source system allow it.
  • If there are no active components on the source system pushing data, I might want to install an instance of MiNiFi on my source system.
Screen Shot 2018-07-19 at 10.32.29

Simple example of a data ingest into Kafka via NiFi

Before I can ingest data into Kafka, I need a new Kafka topic

  • While the Kafka topics “enrichments” and “indexing” Kafka topics will be used by all data sources, the parser topics are specific to a data source.
  • I create a topic named “squid” with a number of partitions that corresponds to the amounts of data I receive.

To make the events searchable, i.e., to store the events into Apache Solr, I need to create a new Solr collection

  • For each parser Storm topology and parser Kafka topic, there is a parser Solr collection.
  • I add a few specific fields common to all Metron Solr collections and optionally define data source specific fields in the schema.xml.
  • I create a new collection named “squid” with a number of shards that corresponds to the amount of data I receive.

I define my parser in the Metron Management UI

  • I click the “+” button in the right bottom corner of the Metron Management UI.
  • I configure my parser by choosing a Java class and/or define a Grok pattern, insert a sample and check if the parsed output is what I expect.
  • I configure the parser: Kafka topic name, Solr collection name, parser config, enrichment defintions, threat intel logic, transformations, parallelism.
  • I save the parser configuration and press the “Play” button next to the new parser to start it.
Screen Shot 2018-07-18 at 08.37.04 1

Metron Management UI with my configured parsers. Currently only the Squid parser is running that produces the events in the first screenshot.

 

Outlook

I hope this post was helpful and informative. For questions I refer to the documentation, future posts, the Metron mailing list or post a question below.