Now I want to shine some light on how the ingestion pipeline architecture looks like. Since I just got started with Apache Metron myself, I hope this helps to kickstart your cyber security efforts. Rather than going too much into the details of what the components do, I’d like to provide a basic overview about which components there are. This architecture can be generalized for all kinds of streaming use cases.
The most important part for Apache Metron is to get the telemetry data into a Apache Kafka topic. In the figure below you can see that there is a Kafka topic and a corresponding parser for each format. Usually, there is one Kafka topic per source type, because each source typically comes in its own special format, but it’s also possible that data of one source has multiple formats or multiple sources have the same format.
- Apache NiFi is being used as the data integration tool.
- In the figure, I added an example of a MiNiFi instance to the Squid Access Log source. In this case MiNiFi is installed on the Squid server node and acts as a log forwarder.
- It’s also possible that sources write directly into Kafka, if they support that. In some cases this might even be a requirement due to performance constraints.
As described in the ingest part: there is a topic for each parser format and an Apache Storm topology reading from this Kafka topic and doing the parsing. A parsed event is then written into the so-called “enrichments” topic.
- The parsing has two purposes:
- it brings all ingest format into a JSON format.
- it introduces a common set of fields shared among all data sources, as well as unique fields that are special to each source.
- Some parsers of common formats are included in the Metron project.
- If there is no parser (that works) for your format, you can use Grok to quickly prototype and launch your parser before you write it in Java.
- It is also possible to launch parser chains to extract information that is convoluted in different formats.
- You can also decide to run only one topology handling multiple parsers in a so-called aggregated parser. This can be combined with parser chains.
The purpose of the enrichment Storm topology is to pick up events from the enrichments topic and add information from external sources. The enriched output is written to an indexing topic.
- A typical enrichment is a lookup in a database to convert an IP address into geo information
- The profiler uses sliding windows to create aggregates/statistics in certain time windows, so-called profiles.
- These profiles can be used to enrich data.
- Metron helps you use any data in HBase to enrich your events.
There are two Storm topologies to read from the indexing topic that persist events, the batch indexing topology and the random access indexing topology. The first utilizes an HDFSBolt to write data to HDFS. The latter one indexes data in Apache Solr.
- There is one Solr collection per data format.
- This way the parsed fields and definitions are kept clean and separated.
- Also, you can authorize different users and groups to different data sources. This is even easier with the Solr Plugin for Apache Ranger.
- HDFS is used as long term storage for analytical purposes and to use the data to create machine learning models.
- Solr is being used for direct fast random access and search capabilities, e.g. by the Metron Alerts UI. It makes sense to store the data for only a limited amount of time for performance reasons.
- It’s quite easy to create a new collection. I’ve described it on this github gist. I’ve added properties in the solrconfig.xml to define a “time to live” for an event in Solr, after which the event will be deleted from the collection.
- Instead of Solr, you can use Elastic Search.
I hope this clears a few things up and can be useful for somebody, either for somebody trying to implement Metron or for somebody interested in how modern streaming pipelines look like in general. If you have questions, don’t hesitate to ask the experts in the Metron mailing list (firstname.lastname@example.org) or get support from the Hortonworks Community.