How to Create a New Parser for Apache Metron

This blog entry goes through the process of a Cyber Platform Operator creating a new parser for Apache Metron and everything you need to consider to make this process as smooth as possible. This can also be seen as a checklist or to-do list when you are creating a new parser.

Assumption: You know what Metron is, the data source is fully onboarded on your platform and the parser config is the only thing that’s missing. Here are the things you need to consider to onboard a new source.

In general, this article walks you through 3 phases:

  • Check if you can re-use an existing parser. If so, you’re done, the testing part of phase 2 still applies, though.
  • Build and test a protoype. Grok is your friend.
  • Write your parser in Java.

Phase 1: Check if you can use an existing parser

  • Get a sample set of your source to test with. The more diverse you expect the formats of the same source to be, the bigger your sample size should be. 20 should be ok to start with.
  • Check the format of the string.
    • If it is in JSON format, use the JSON parser!
    • If it’s a comma separated line, use the CSV parser!
    • Or generally: If it’s in a format of any of the included Metron parsers, use this parser: CEF, Lancope, PaloAltoFirewall, Sourcefire, Logstash, FireEye, Asa, Snort, JSONMap, Ise, GrokWebSphere, Bro,….
    • If it’s something else use the Grok parser!

Phase 2: Build and test a (Grok) prototype

In the rest of the article I assume that you don’t re-use one of the included parsers, which is why you want to create your own custom one. Thus, you leverage the Grok parser. However, the test setup described below and can be used for any kind of parser.

  • Use http://grokdebug.herokuapp.com/ to test one of your samples and start with adding  %{GREEDYDATA:message} and continuously add more precise parsing statements and check if it compiles. If you’re new to Grok start here: https://logz.io/blog/logstash-grok/.
  • Test all of your samples in the app to check if your Grok statement is general enough.
  • You also might want to append %{GREEDYDATA:suffix}(\n|\r|\r\n)?+ to catch any kind of additional data, as well as filter newline and optional carriage-return fields at the end of a line. That depends on how diverse or clean your data source is.
  • Configure and validate the parser in Metron Management UI using “Grok” as parser type and paste the grok statement in the field “Grok Statement”.
    • Attention: don’t forget to define the timestampField, the timeFields and the dateFormat. If you don’t specify those values, the parser validation will fail with an “error_type”: “parser_invalid”. The field configured as the timestampField will be converted into a timestamp parsed based on the inputs from the dateFormat field. Use the Joda time date format documented here.
    • When the datetime is correctly parsed, double-check if the calculated timestamp matches the input time. This online epochconverter comes handy.
    • Note: to consolidate your view of the data across many sources, make sure you name the source ip address “ip_src_addr”, your destination ip address “ip_dst_addr”, your source port “ip_src_port” and your destination port “ip_dst_port”.
    • Note: In general, every parser – not only the Grok parser – has their specific required/default parameters to be set. Read the parser docs to be sure to configure the parsers correctly. Below is an example of how the parserConfig part of your parser configuration file should look like. You configure this part in the Metron Management UI:

metron_managementui_parser.png

  • Double check:
    • If Grok statements are stored in the configured HDFS path: /apps/metron/patterns/mycustomparser
    • If the Zookeeper configuration is up to date: bin/zkCli.sh -server <zookeeper-quorum> get /metron/topology/parsers/mycustomparser.  Specifically look for the parserConfig part shown below.
{
...
"parserClassName": "org.apache.metron.parsers.GrokParser",
"parserConfig": {
    "grokPath": "/apps/metron/patterns/mycustomparser",
    "patternLabel": "MYCUSTOMPARSER",
    "timestampField": "datetime",
    "timeFields": ["datetime"],
    "dateFormat": "yyyy-MM-dd HH-mm-ss",
    "timezone": "UTC"
},
...
}
  • Start the parser topology, e.g., by clicking on the “Play” button in the Metron Management UI.
  • Note:
    • When you change the pattern or the parser config, a restart of the topology should not be required. Double-check if the fields are correct by watching the enrichments topic or the indexing topics with the Kafka console consumer.
  • Build a little test setup in NiFi to be able to ingest and test the messages over and over. On a test cluster start this test setup whenever you perform changes to your Grok parser.
metron_parser_test_setup.png

Metron parser test setup: 1. Consume from the parser topic (assuming you initially ingested your sample already) 2. Control your flow rate to release only 1 event per 5 seconds (or which ever speed you like) 3. Write back into the parser topic and check if the event is being processed correctly.

  • Ingest the messages into the Kafka topic using your NiFi test setup and check if they are successfully persisted in your desired collection.

Phase 3: Make your Metron parser production ready

Once you have your custom parser up and running and improved it continuously you want to create something more stable with higher performance than Grok statements. However, nothing is for free. You need to get your hands dirty in Java. Fortunately, it’s not a lot of dirt and it’s quite easy to write your own parser by extending the BasicParser class.

  • Check out this part of the documentation to get a walkthrough: 3rd party parsers
  • In this part of the documentation you’ll learn to:
    • Get to know which dependencies you need.
    • Implement a parser method of your custom parser class extending the BasicParser class.
    • Build the jar and deploy it in the extra-parser directory.
    • Restart Metron Rest service to pick up the new parser from your jar file.
    • Add your parser in the Metron Management UI by choosing your parser type.
    • Configure and start your parser.
  • Stop your interim Grok parser and start your custom Java parser.

Apache Metron as an Example for a Real Time Data Processing Pipeline

In my previous blog post I was writing a little bit about what Apache Metron is and How to Onboard a New Data Source in Apache Metron.

Now I want to shine some light on how the ingestion pipeline architecture looks like. Since I just got started with Apache Metron myself, I hope this helps to kickstart your cyber security efforts. Rather than going too much into the details of what the components do, I’d like to provide a basic overview about which components there are.

This architecture can be generalized for all kinds of streaming use cases. The pipeline uses Apache NiFi for ingest, Apache Kafka as an event buffer, Apache Storm for stream processing, Apache Hadoop for long term storage and Apache Solr for short term random access storage. If you design your own pipeline for a different use case, you can, e.g., swap Apache Storm with frameworks such as Apache Flink or Spark Streaming (or any other frameworks out in the wild with their pros and cons). Choosing the right piece of technology strongly depends on numerous factors, I’m not going into in this article.

metron_pipeline

End to End Processing Pipeline for Apache Metron

Ingest

The most important part for Apache Metron is to get the telemetry data into an Apache Kafka topic. In the figure below you can see that there is a Kafka topic and a corresponding parser for each format. Usually, there is one Kafka topic per source type, because each source typically comes in its own special format, but it’s also possible that data of one source has multiple formats or multiple sources have the same format.

metron_pipeline_ingest_closeup.png

  • Apache NiFi is being used as the data integration tool.
  • In the figure, I added an example of a MiNiFi instance to the Squid Access Log source. In this case MiNiFi is installed on the Squid server node and acts as a log forwarder.
  • It’s also possible that sources write directly into Kafka, if they support that. In some cases this might even be a requirement due to performance constraints.

Parsing

As described in the ingest part: there is a topic for each parser format and an Apache Storm topology reading from this Kafka topic and doing the parsing. A parsed event is then written into the so-called “enrichments” topic.

metron_pipeline_parsing_closeup

  • The parsing has two purposes:
    • it brings all ingest format into a JSON format.
    • it introduces a common set of fields shared among all data sources, as well as unique fields that are special to each source.
  • Some parsers of common formats are included in the Metron project.
  • If there is no parser (that works) for your format, you can use Grok to quickly prototype and launch your parser before you write it in Java.
  • It is also possible to launch parser chains to extract information that is convoluted in different formats.
  • You can also decide to run only one topology handling multiple parsers in a so-called aggregated parser. This can be combined with parser chains.

Enrichment

The purpose of the enrichment Storm topology is to pick up events from the enrichments topic and add information from external sources. The enriched output is written to an indexing topic.

metron_pipeline_enrichment_closeup

  • A typical enrichment is a lookup in a database to convert an IP address into geo information
  • The profiler uses sliding windows to create aggregates/statistics in certain time windows, so-called profiles.
  • These profiles can be used to enrich data.
  • Metron helps you use any data in HBase to enrich your events.

Persisting

There are two Storm topologies to read from the indexing topic that persist events, the batch indexing topology and the random access indexing topology. The first utilizes an HDFSBolt to write data to HDFS. The latter one indexes data in Apache Solr.

metron_pipeline_persisting_closeup

  • There is one Solr collection per data format.
    • This way the parsed fields and definitions are kept clean and separated.
    • Also, you can authorize different users and groups to different data sources. This is even easier with the Solr Plugin for Apache Ranger.
  • HDFS is used as long term storage for analytical purposes and to use the data to create machine learning models.
  • Solr is being used for direct fast random access and search capabilities, e.g. by the Metron Alerts UI. It makes sense to store the data for only a limited amount of time for performance reasons.
  • It’s quite easy to create a new collection. I’ve described it on this github gist. I’ve added properties in the solrconfig.xml to define a “time to live” for an event in Solr, after which the event will be deleted from the collection.
  • Instead of Solr, you can use Elastic Search.

Conclusion

I hope this can be useful for somebody, either trying to implement Metron or somebody interested in how modern streaming pipelines look like in general. If you have questions, don’t hesitate to ask the experts in the Metron mailing list (user@metron.apache.org) or get support from the Hortonworks Community.

How to Onboard a New Data Source in Apache Metron

Introduction

Apache Metron aims to be a tool for analysts in a cyber security team to help them defining intelligent alerts, detecting threats and work on them in real-time. This is the first blog post in a row to ease operations and share my experiences with Apache Metron. Thus, it serves as an introduction to Metron.

logo

Technical Introduction

Apache Metron is a cyber security platform making heavy use of the Hadoop Ecosystem to create a scalable and available solution. It utilizes Apache Storm and Apache Kafka to parse, enrich, profile, and eventually index data from telemetry sources, such as network traffic, firewall logs, or application logs in real-time. Apache Solr or Elastic Search are used for random access searches, while Apache Hadoop HDFS is used for long term and analytical storage. It comes with its own scripting language “Stellar” to query, transform and enrich data. A security operator/analyst uses the Metron Management UI to configure and manage input sources as well the Metron Alerts UI to search, filter and group events.

Screen Shot 2018-07-18 at 08.07.45

Metron Alerts UI, showing a few dummy events from a Squid log.

Scope of this Post

Since virtually every data source can be used to generate events, it is natural that the platform operator/analyst wants to add data from new sources over time. I use this post as a small check list, to document considerations for the “onboarding” process of new data sources. You might want to automate this process in a way that works for you. In future posts I will cover the steps in detail.

Onboard a New Data Source

I need to ingest data to Kafka

  • It’s very handy to use Apache NiFi for the ingest part. Just create a data flow consisting of two processors: a simple tcp listener to receive data and a Kafka producer to push the event further into Kafka.
  • I can also push data directly into Kafka if the architecture,  firewall and the source system allow it.
  • If there are no active components on the source system pushing data, I might want to install an instance of MiNiFi on my source system.
Screen Shot 2018-07-19 at 10.32.29

Simple example of a data ingest into Kafka via NiFi

Before I can ingest data into Kafka, I need a new Kafka topic

  • While the Kafka topics “enrichments” and “indexing” Kafka topics will be used by all data sources, the parser topics are specific to a data source.
  • I create a topic named “squid” with a number of partitions that corresponds to the amounts of data I receive.

To make the events searchable, i.e., to store the events into Apache Solr, I need to create a new Solr collection

  • For each parser Storm topology and parser Kafka topic, there is a parser Solr collection.
  • I add a few specific fields common to all Metron Solr collections and optionally define data source specific fields in the schema.xml.
  • I create a new collection named “squid” with a number of shards that corresponds to the amount of data I receive.

I define my parser in the Metron Management UI

  • I click the “+” button in the right bottom corner of the Metron Management UI.
  • I configure my parser by choosing a Java class and/or define a Grok pattern, insert a sample and check if the parsed output is what I expect.
  • I configure the parser: Kafka topic name, Solr collection name, parser config, enrichment defintions, threat intel logic, transformations, parallelism.
  • I save the parser configuration and press the “Play” button next to the new parser to start it.
Screen Shot 2018-07-18 at 08.37.04 1

Metron Management UI with my configured parsers. Currently only the Squid parser is running that produces the events in the first screenshot.

 

Outlook

I hope this post was helpful and informative. For questions I refer to the documentation, future posts, the Metron mailing list or post a question below.

 

How to Troubleshoot an Apache Storm Topology

Apache Storm is a real-time, fault-tolerant, event-based streaming framework and platform that runs your code in a highly parallelized way on distributed nodes. It’s all about Spouts (processing units to read from data sources) and Bolts (general processing units). Storm is often used to read data from Apache Kafka and write the results back to Kafka or to a data store. Apache Storm and Apache Kafka are the work horses of the cyber security platform Apache Metron. Storm is also being used internally by the Streaming Analytics Manager (SAM)

This article guides you through the debugging process and points you to the places you need to tweak your configuration to get your topology up and running in a kerberized environment in case certain errors occur. For basic information on how to authenticate your application check out the reference implementation by Pierre Villard on his Github page.

Storm_logo

Prerequisites

I assume that you start from a certain point:

  • Your Storm cluster and the services you communicate with (Kafka, Zookeeper, HBase) is up and running as well as secure, i.e., the authentication happens through the Kerberos protocol.
  • Your Storm cluster is configured to run topologies as the OS user corresponding to the Kerberos principal who submitted the topology. (See: “Run worker processes as user who submitted the topology” in the excellent article of the Storm documentation)
  • Your topology (written in Java) is ready to be deployed and authentication is put in place.

Debugging Process

  • Use the Storm UI to check if the topology’s workers are throwing any errors and on which machine they are running! The worker’s log files are stored on the machine the worker is running in  /var/log/storm/workers-artifacts/<topology-name><unique-id>/<port-number>/worker.log.
  • Check the input data and output data of your Storm topology. In case you are using Kafka, connect via the Kafka console consumer and read from the input and the output topic of your topology! If you don’t see any events in the input Kafka topic, you should check upstream for errors. If you do see input events, but no output events, refer to your topology logs described in the item above. If you do see output events, check if they have the expected format (data format, number and kind of fields are correct, fields contain data that makes sense as opposed to null values)

Screen Shot 2018-07-16 at 19.49.37

# List Kafka topics:
bin/kafka-topics.sh --zookeeper <zookeeper.hostname>:<zookeeper.port> --list

# print messages as they are written on stdout from input topic
bin/kafka-console-consumer.sh --bootstrap-server <kafka.broker.hostname>:<kafka.broker.port> --topic input

# print messages as they are written on stdout from output topic
bin/kafka-console-consumer.sh --bootstrap-server <kafka.broker.hostname>:<kafka.broker.port> --topic output

Possible Error Scenarios

Authentication Errors Exception

Caused by: javax.security.auth.login.LoginException: Could not login: the client is being asked for a password, but the Kafka client code does not currently support obtaining a password from the user. not available to garner authentication information from the user

Your topology is being submitted and the supervisor tries to start and initialize the Spouts and Bolts in the worker process based on the configuration you provided. When this error occurs the worker process is killed and the supervisor tries to spawn a new worker process. On the machine the worker is supposed to run, you can see a worker process popping up with a certain PID (ps aux | grep <topology_name>). A few seconds later this process is killed and a process exactly as the old one is started with a different PID. You can also tail the worker log and see this error message. Soon afterwards the “Worker has died” message appears. This can happen for various reasons:

  • The OS user running the topology does not have the permission to read the keytabs configured in the jaas config file. Check with ps aux or top which user is running and check if the keytab has the correct POSIX attributes. Usually it should be read-only by the owning user (-r– — — <topology-user><topology-user>)
  • The jaas configuration points to the wrong keytabs to be used for authentication and the OS user does not have permission to those. Check with ps aux which jaas file is configured. You might find an option there. Check if this jaas config file has the desired authentication options configured. If not configure your own and pass it to the topology.
-Djava.security.auth.login.config=/etc/storm/conf/client_jaas.conf

Basics of Hadoop User Management

Hadoop is old, everyone has their own Hadoop cluster and everyone knows how to use it. It’s 2018, right? This article is just a collection of a few gotchas, dos and don’ts with respect to User Management that shouldn’t happen in 2018 anymore.

Terminology

Just a few terms and definitions so that everyone is on the same page for the rest of the article. Roll your eyes and skip that section if you are an advanced user.

  • OS user = user that is provisioned on the operating system level of the nodes of a Hadoop cluster. You can check if the user exists on OS level by doing

id <username>

  • KDC = Key Distribution Center. This might be a standalone KDC implementation, such as the MIT KDC or an integrated one behind a Microsoft Active Directory.
  • Keytab = file that stores the encrypted password of a user provisioned in a KDC. Can be used to authenticate without the need of typing the password using the “kinit” command line tool.

Do

  • Make sure your users are available on all nodes of the OS, as well as in the KDC. This is important for several reasons:
    • When you run a job, the job might create staging/temporary directories in the /tmp/ directory, which are owned by the user running the job. The name of the directory is the name of the OS user, while the ownership belongs to authenticated user. In a secure cluster the authenticated user is the user you obtained a Kerberos ticket for from the KDC.
    • Keytabs on OS level should be only readable by the user OS user who is supposed to authenticate with them for security purposes.
    • When impersonation is turned on for services, e.g., Oozie using the -doas tag, Hive using the property hive.server2.enable.doAs=True property or Storm using the supervisor.run.worker.as.user=true  property, a user authenticated as a principal will run on the OS level as a processs owned by that user. If that user is not known to the OS, the job will fail (to start).

Don’t

  • Don’t use the hdfs user to run jobs on YARN (it’s forbidden by default and don’t change that configuration). Your problem can be solved in a different way! Only use the hdfs user for administrative tasks on the command line.
  • Don’t run Hive jobs as the “hive” user. The “hive” user is the administrative user and if at all should only be used by the Hadoop/database administrator.
  • Or in general: Don’t use the <service name> user to do  <operation> on <service name>. You saw that coming, hm?

How to Achieve Synchronisation of KDC and OS Level

(…or other user/group management systems). This is a tricky one, if you don’t want to run into a split brain situation, where one system knows one set of users and another one knows others, which may or may not overlap.

  • Automate user provisioning, e.g., by using an Ansible role that provisions a user in the KDC and on all nodes of the Hadoop cluster.
  • Use services such as SSSD (System Security Services Daemon) that integrates users and groups from user and group management services into the operating system. So you won’t need to actually add them to each node, as long as SSSD is up and running.
  • Manually create OS users on all nodes and in the KDC (don’t do that, obviously ;P )

 

Maybe I’ll expand that list in the future 🙂

4 Things Factorio Taught Me about DevOps

What is Factorio?

Factorio is a computer game. You probably ask yourself, in which ways a computer game is related to this blog? Well, not at all – or is it? Let’s find out.

maxresdefault

Basically, in the game you take over the role of a character in 3rd person perspective, whose rocket ship crashed on a foreign planet. You don’t have anything, but a pick axe in the beginning. Your goal is to repair your rocket ship to go home safely. In the course of the game you’ll go from picking resources manually and crafting items manually to building mines, creating items in factories and connecting them using conveyor belts. Later on you even built your own power plants, electric networks and trains. The game is not forcing you to automate everything, but if you didn’t you probably wouldn’t reach your goal in this lifetime. In the game’s “About” section you can find how this goal is supposed to be achieved:

You will be mining resources, researching technologies, building infrastructure, automating production and fighting enemies. Use your imagination to design your factory, combine simple elements into ingenious structures, apply management skills to keep it working and finally protect it from the creatures who don’t really like you.

Does that sound familiar to you? No? – Then read it again and think about it as if it were a very honest job description for an engineering role in a large corporation which just started to “explore” Agile and DevOps methodologies. This is when I started to think about analogies between the game and IT projects.

What the game taught me about IT automation

A lot of those points feel super logical just writing them down, and not even worth mentioning, but then again, why don’t companies start acting on them?

  • You can only succeed if you keep automating processes, automate those processes that save you the most time at the moment and keep automating what you automated and combine automated processes.
  • You can’t completely avoid doing and fixing things manually, nor should you. There’s always the aliens living on the planet* destroying your infrastructure and pipelines. In the beginning before you build repair robots, you need to do a lot of manual repair work. Also, keep in mind to not automate everything. Keep doing things manually that don’t actually justify the time you spend automating processes at that certain point of time. You may want to automate those processes later on, but focus on the important (= most repetitive and time-consuming) ones first.
    *Think about those aliens as an analogy to either software bugs or users or to colleagues who don’t want to move away from waterfall methodology. 😛
  • If you are not the only one working on a project you need to talk to each other and continuously update each other on your (changing) plans. You need to define interfaces and locations where your infrastructure and processes can interact with the infrastructure and processes of your co-workers.
  • You can only succeed if you tear down a working implementation to build something more stable and scalable, even if it means that you will invest lots of resources and time. That doesn’t mean that you shouldn’t have built that piece of infrastructure at all. And it also doesn’t mean that the new piece of infrastructure will be there forever.

I’m not sure myself anymore if I am talking about the game or real-life IT. Thus, it seems to be a working analogy. If you know both the game and real-life DevOps methodologies go ahead and post a comment if you can think about something that is missing in my list above.

This analogy is interesting enough to start checking out more analogies of the game. I’ll create one or two future blog posts on those analogies, such as multi-threading and databases.

 

How to Troubleshoot Apache Knox

Apache Knox is a gateway application and the door to access data in a Hadoop cluster hidden behind a firewall. While the usage is fairly simple the setup, configuration and debugging process can be tedious due to many different components that Apache Knox ties together. On Hortonworks Community Connection I wrote an article that shows you exactly what could be wrong with your Knox setup and how to efficiently track the error.

15403-17-05-08-security-concepts-knox

Commont Security Architecture Using Apache Knox for Data Access