How to Create a New Parser for Apache Metron

This blog entry goes through the process of a Cyber Platform Operator creating a new parser for Apache Metron and everything you need to consider to make this process as smooth as possible. This can also be seen as a checklist or to-do list when you are creating a new parser.

Assumption: You know what Metron is, the data source is fully onboarded on your platform and the parser config is the only thing that’s missing. Here are the things you need to consider to onboard a new source.

In general, this article walks you through 3 phases:

  • Check if you can re-use an existing parser. If so, you’re done, the testing part of phase 2 still applies, though.
  • Build and test a protoype. Grok is your friend.
  • Write your parser in Java.

Phase 1: Check if you can use an existing parser

  • Get a sample set of your source to test with. The more diverse you expect the formats of the same source to be, the bigger your sample size should be. 20 should be ok to start with.
  • Check the format of the string.
    • If it is in JSON format, use the JSON parser!
    • If it’s a comma separated line, use the CSV parser!
    • Or generally: If it’s in a format of any of the included Metron parsers, use this parser: CEF, Lancope, PaloAltoFirewall, Sourcefire, Logstash, FireEye, Asa, Snort, JSONMap, Ise, GrokWebSphere, Bro,….
    • If it’s something else use the Grok parser!

Phase 2: Build and test a (Grok) prototype

In the rest of the article I assume that you don’t re-use one of the included parsers, which is why you want to create your own custom one. Thus, you leverage the Grok parser. However, the test setup described below and can be used for any kind of parser.

  • Use http://grokdebug.herokuapp.com/ to test one of your samples and start with adding  %{GREEDYDATA:message} and continuously add more precise parsing statements and check if it compiles. If you’re new to Grok start here: https://logz.io/blog/logstash-grok/.
  • Test all of your samples in the app to check if your Grok statement is general enough.
  • You also might want to append %{GREEDYDATA:suffix}(\n|\r|\r\n)?+ to catch any kind of additional data, as well as filter newline and optional carriage-return fields at the end of a line. That depends on how diverse or clean your data source is.
  • Configure and validate the parser in Metron Management UI using “Grok” as parser type and paste the grok statement in the field “Grok Statement”.
    • Attention: don’t forget to define the timestampField, the timeFields and the dateFormat. If you don’t specify those values, the parser validation will fail with an “error_type”: “parser_invalid”. The field configured as the timestampField will be converted into a timestamp parsed based on the inputs from the dateFormat field. Use the Joda time date format documented here.
    • When the datetime is correctly parsed, double-check if the calculated timestamp matches the input time. This online epochconverter comes handy.
    • Note: to consolidate your view of the data across many sources, make sure you name the source ip address “ip_src_addr”, your destination ip address “ip_dst_addr”, your source port “ip_src_port” and your destination port “ip_dst_port”.
    • Note: In general, every parser – not only the Grok parser – has their specific required/default parameters to be set. Read the parser docs to be sure to configure the parsers correctly. Below is an example of how the parserConfig part of your parser configuration file should look like. You configure this part in the Metron Management UI:
metron_managementui_parser.png
  • Double check:
    • If Grok statements are stored in the configured HDFS path: /apps/metron/patterns/mycustomparser
    • If the Zookeeper configuration is up to date: bin/zkCli.sh -server <zookeeper-quorum> get /metron/topology/parsers/mycustomparser.  Specifically look for the parserConfig part shown below.
{
  ...
  "parserClassName": "org.apache.metron.parsers.GrokParser",
  "parserConfig": {
    "grokPath": "/apps/metron/patterns/mycustomparser",
    "patternLabel": "MYCUSTOMPARSER",
    "timestampField": "datetime",
    "timeFields": ["datetime"],
    "dateFormat": "yyyy-MM-dd HH-mm-ss",
    "timezone": "UTC"
  },
  ...
}
metron_parser_test_setup.png
Metron parser test setup: 1. Consume from the parser topic (assuming you initially ingested your sample already) 2. Control your flow rate to release only 1 event per 5 seconds (or which ever speed you like) 3. Write back into the parser topic and check if the event is being processed correctly.
  • Ingest the messages into the Kafka topic using your NiFi test setup and check if they are successfully persisted in your desired collection.

Phase 3: Make your Metron parser production ready

Once you have your custom parser up and running and improved it continuously you want to create something more stable with higher performance than Grok statements. However, nothing is for free. You need to get your hands dirty in Java. Fortunately, it’s not a lot of dirt and it’s quite easy to write your own parser by extending the BasicParser class.

  • Check out this part of the documentation to get a walkthrough: 3rd party parsers
  • In this part of the documentation you’ll learn to:
    • Get to know which dependencies you need.
    • Implement a parser method of your custom parser class extending the BasicParser class.
    • Build the jar and deploy it in the extra-parser directory.
    • Restart Metron Rest service to pick up the new parser from your jar file.
    • Add your parser in the Metron Management UI by choosing your parser type.
    • Configure and start your parser.
  • Stop your interim Grok parser and start your custom Java parser.

Apache Metron as an Example for a Real Time Data Processing Pipeline

In my previous blog post I was writing a little bit about what Apache Metron is and How to Onboard a New Data Source in Apache Metron.

Now I want to shine some light on how the ingestion pipeline architecture looks like. Since I just got started with Apache Metron myself, I hope this helps to kickstart your cyber security efforts. Rather than going too much into the details of what the components do, I’d like to provide a basic overview about which components there are.

This architecture can be generalized for all kinds of streaming use cases. The pipeline uses Apache NiFi for ingest, Apache Kafka as an event buffer, Apache Storm for stream processing, Apache Hadoop for long term storage and Apache Solr for short term random access storage. If you design your own pipeline for a different use case, you can, e.g., swap Apache Storm with frameworks such as Apache Flink or Spark Streaming (or any other frameworks out in the wild with their pros and cons). Choosing the right piece of technology strongly depends on numerous factors, I’m not going into in this article.

metron_pipeline
End to End Processing Pipeline for Apache Metron

Ingest

The most important part for Apache Metron is to get the telemetry data into an Apache Kafka topic. In the figure below you can see that there is a Kafka topic and a corresponding parser for each format. Usually, there is one Kafka topic per source type, because each source typically comes in its own special format, but it’s also possible that data of one source has multiple formats or multiple sources have the same format.

metron_pipeline_ingest_closeup.png
  • Apache NiFi is being used as the data integration tool.
  • In the figure, I added an example of a MiNiFi instance to the Squid Access Log source. In this case MiNiFi is installed on the Squid server node and acts as a log forwarder.
  • It’s also possible that sources write directly into Kafka, if they support that. In some cases this might even be a requirement due to performance constraints.

Parsing

As described in the ingest part: there is a topic for each parser format and an Apache Storm topology reading from this Kafka topic and doing the parsing. A parsed event is then written into the so-called “enrichments” topic.

metron_pipeline_parsing_closeup
  • The parsing has two purposes:
    • it brings all ingest format into a JSON format.
    • it introduces a common set of fields shared among all data sources, as well as unique fields that are special to each source.
  • Some parsers of common formats are included in the Metron project.
  • If there is no parser (that works) for your format, you can use Grok to quickly prototype and launch your parser before you write it in Java.
  • It is also possible to launch parser chains to extract information that is convoluted in different formats.
  • You can also decide to run only one topology handling multiple parsers in a so-called aggregated parser. This can be combined with parser chains.

Enrichment

The purpose of the enrichment Storm topology is to pick up events from the enrichments topic and add information from external sources. The enriched output is written to an indexing topic.

metron_pipeline_enrichment_closeup
  • A typical enrichment is a lookup in a database to convert an IP address into geo information
  • The profiler uses sliding windows to create aggregates/statistics in certain time windows, so-called profiles.
  • These profiles can be used to enrich data.
  • Metron helps you use any data in HBase to enrich your events.

Persisting

There are two Storm topologies to read from the indexing topic that persist events, the batch indexing topology and the random access indexing topology. The first utilizes an HDFSBolt to write data to HDFS. The latter one indexes data in Apache Solr.

metron_pipeline_persisting_closeup
  • There is one Solr collection per data format.
    • This way the parsed fields and definitions are kept clean and separated.
    • Also, you can authorize different users and groups to different data sources. This is even easier with the Solr Plugin for Apache Ranger.
  • HDFS is used as long term storage for analytical purposes and to use the data to create machine learning models.
  • Solr is being used for direct fast random access and search capabilities, e.g. by the Metron Alerts UI. It makes sense to store the data for only a limited amount of time for performance reasons.
  • It’s quite easy to create a new collection. I’ve described it on this github gist. I’ve added properties in the solrconfig.xml to define a “time to live” for an event in Solr, after which the event will be deleted from the collection.
  • Instead of Solr, you can use Elastic Search.

Conclusion

I hope this can be useful for somebody, either trying to implement Metron or somebody interested in how modern streaming pipelines look like in general. If you have questions, don’t hesitate to ask the experts in the Metron mailing list (user@metron.apache.org) or get support from the Hortonworks Community.