Grok and regex processors aren’t just for pulling fields out of logs; they’re your primary tools for teaching your APM system to understand the unstructured noise of your applications.

Here’s a sample of what a real-time log stream might look like, right before we apply any parsing:

2023-10-27 10:30:01 INFO [main] com.example.MyApp: User 'alice' logged in from 192.168.1.100. Session ID: xyz789abc
2023-10-27 10:30:02 WARN [pool-2-thread-5] com.example.AuthService: Failed login attempt for user 'bob' from 10.0.0.5. Reason: Invalid password.
2023-10-27 10:30:03 INFO [main] com.example.MyApp: Processing order 12345 for user 'alice'.

We want to extract meaningful fields like level, thread, logger, user, ip_address, session_id, and message.

The default approach to APM log ingestion often treats each log line as a single, opaque string. This is like receiving a letter written in a foreign language without a translator. You can see the ink and paper, but you can’t understand the content. APM systems, by default, will store these lines, but querying for specific events, like "all failed login attempts" or "all requests by user 'alice'," becomes a brute-force string search across the entire log corpus. This is inefficient, slow, and error-prone, especially at scale. Grok and regex processors are the translators that enable structured querying.

Let’s start with Grok, which is built on top of regex but provides pre-defined patterns for common log formats.

Grok Pattern:

%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} \[%{DATA:thread}\] %{JAVACLASS:logger}: %{GREEDYDATA:message}

Explanation:

  • %{TIMESTAMP_ISO8601:timestamp}: Matches ISO8601 timestamps (e.g., 2023-10-27T10:30:01Z or 2023-10-27 10:30:01) and names the extracted field timestamp.
  • %{LOGLEVEL:level}: Matches common log levels (INFO, WARN, ERROR, DEBUG, etc.) and names it level.
  • \[%{DATA:thread}\]: Matches text within square brackets, assuming it’s a thread name, and names it thread. DATA is a general pattern that matches any character.
  • %{JAVACLASS:logger}: Matches Java class names (e.g., com.example.MyApp) and names it logger.
  • %{GREEDYDATA:message}: Matches the rest of the line and names it message. This is the catch-all for the main log content.

Applying this Grok pattern to our sample logs would yield:

{
  "timestamp": "2023-10-27 10:30:01",
  "level": "INFO",
  "thread": "main",
  "logger": "com.example.MyApp",
  "message": "User 'alice' logged in from 192.168.1.100. Session ID: xyz789abc"
}

Notice that the message field still contains a lot of useful information we want to extract further. This is where nested parsing or additional regex processors come in.

Regex Processor for Nested Fields:

To extract user, ip_address, and session_id from the message field of the first log line, we can use a regex processor.

Regex Pattern:

User '(?P<user>\w+)' logged in from (?P<ip_address>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})(?:.*Session ID: (?P<session_id>\w+))?

Explanation:

  • User '(?P<user>\w+)': Matches the literal "User '" followed by one or more word characters (\w+) captured into a group named user, and then a closing quote.
  • logged in from : Matches the literal string.
  • (?P<ip_address>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}): Captures a standard IPv4 address pattern into a group named ip_address. \d{1,3} matches one to three digits, and \. matches a literal dot.
  • (?:.*Session ID: (?P<session_id>\w+))?: This is an optional, non-capturing group ((?:...)) that looks for "Session ID: " followed by word characters (\w+) captured into a group named session_id. The ? at the end makes the entire session ID part optional, so logs without it still parse.

After applying this regex to the message field of the first log line, the result would be:

{
  "timestamp": "2023-10-27 10:30:01",
  "level": "INFO",
  "thread": "main",
  "logger": "com.example.MyApp",
  "message": "User 'alice' logged in from 192.168.1.100. Session ID: xyz789abc",
  "user": "alice",
  "ip_address": "192.168.1.100",
  "session_id": "xyz789abc"
}

The original message field often remains for context, but you now have the extracted, queryable fields.

For the second log line, a different regex might be needed to capture the "Failed login attempt" details:

Regex Pattern for Failed Login:

Failed login attempt for user '(?P<user>\w+)' from (?P<ip_address>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\. Reason: (?P<reason>.+)

Applying this to the second log line’s message field would extract:

{
  "timestamp": "2023-10-27 10:30:02",
  "level": "WARN",
  "thread": "pool-2-thread-5",
  "logger": "com.example.AuthService",
  "message": "Failed login attempt for user 'bob' from 10.0.0.5. Reason: Invalid password.",
  "user": "bob",
  "ip_address": "10.0.0.5",
  "reason": "Invalid password."
}

The power of this approach is that you can chain these processors. A common pattern is to use a broad Grok pattern first to get the basic structure, and then apply specific regex processors conditionally or sequentially to the message field to extract finer-grained details based on the content.

The most surprising truth about Grok and regex processors is that they are stateful in a way that often trips people up: the patterns you define are applied sequentially to the original incoming log line, not to the output of the previous processor unless you explicitly configure it to operate on a generated field. This means you often need to parse out common elements first (like timestamp, level) and then, in a separate processor, parse the original message field again, or a field you previously extracted into. Many APM systems allow you to specify which field a processor should operate on. If not specified, it defaults to the entire log line.

The next concept you’ll encounter is how to handle dynamic field names or nested structures that change frequently, often requiring more advanced scripting processors or custom patterns.

Want structured learning?

Take the full Elastic-apm course →