Gravwell 5.4.0 introduces alerts, a new feature that makes it easier to take automated actions based on the results of scheduled searches. Users define scheduled searches whose results they wish to take action on; when those searches generate events, one or more flows will be executed to process those events. In this post we'll discuss exactly what makes up an alert, then we'll take a look at a real alert we've been using on Gravwell infrastructure.
Since introducing flows, we've seen a lot of users repeating a basic pattern: they define a flow which runs a query, checks if there were any results, and then takes some action if there were results: send an email, create a ticket, notify in Slack, etc. Although it works fine, it means you end up with dozens of flows that all look more or less identical – and if you want to modify e.g. the recipient of an email, it means walking through each flow to update it!
Alerts basically formalize this pattern in a way that avoids needless duplication. An alert links one or more scheduled searches to one or more flows. When any scheduled search runs and returns results, those results are considered alert events. The events are ingested to a user-selected tag along with metadata, then every consumer is run once per event.
These concepts become much more obvious when you look at a real event, so let's do that – but first, here's some basic terminology you can reference as we go along:
Now let's take a look at a real alert. We use this alert to monitor some of our company infrastructure at Gravwell:
There's a lot going on, so let's break it down. First off, there's the set of Dispatchers on the left:
These are three scheduled searches which detect brute-force login attempts against our Gravwell cluster, failed logins to our AWS accounts, and logins to our AWS accounts that originated outside the US. Here's the query which detects brute-force logins:
tag=gravwell syslog Message == "Authentication failure" Structured[gw@1].host Structured[gw@1].user Appname == webserver Hostname
| stats count by user Hostname host
| eval (count > 3)
| geoip host.CountryName
| printf -e message "Potential brute force login attempt on %s for user %s from %s (%s). %d attempts" Hostname user host CountryName count
| enrich critical false level low
| alias TIMESTAMP timestamp
| table -nt TIMESTAMP level critical message
Note how it builds up a user-friendly message, and uses the enrich module to set critical=false and level=low. Those three fields correspond to the fields defined in the Validation section of the alert definition:
Validation is an optional (but highly recommended) way to ensure that all of your dispatchers produce similar-looking results. All three dispatcher queries have fields named message, level, and critical in their outputs. If one didn't, there would be a warning message next to it in the dispatcher column.
Because all three dispatchers produce similar-looking results, we can use the same simple flow to process events from any one of the dispatchers. In this alert, we've defined only one Consumer, but we could add another one which e.g. sends an email; any time events are generated, both consumers would be executed.
The flow itself is very simple: it uses the Text Template node to build up a message, then sends it to our internal chat server with the Mattermost node:
Note how the template references .event.Contents.critical, .event.Contents.message, and .event.Contents.level; these are the three fields we guaranteed each dispatcher would include in its events.
Finally, note that the alert is configured to ingest events into the _alerts tag. It will only process up to 16 results from any given execution of the dispatcher; this is intended to avoid a flood of notifications if something unexpected occurs.
Let's see what happens when we trigger the alert by attempting a "brute force" login against the Gravwell system, purposefully failing to log in as "admin" several times.
Our first indication that the alert actually triggered comes from a message on the internal Mattermost channel:
If we look in the tag specified on the alert definition, we can find the event that was ingested; note that Gravwell ingests not only the results of the query but metadata about which alert was triggered, which dispatcher triggered it, and which consumers were run.
If the search which looks for failed AWS logins fired instead, the same flow would execute and post a similar message to our Mattermost channel; a similar-looking event would be ingested into the same tag.