Datadog’s log processing pipelines can enrich your logs with custom attributes before they even hit the indexing layer, dramatically reducing storage costs and improving query performance.
Let’s see this in action. Imagine you’re getting web server logs from Nginx, and you want to add the HTTP status code as a distinct attribute for easier filtering.
2023-10-27T10:00:00Z INFO webserver.access: 192.168.1.10 - - [27/Oct/2023:10:00:00 +0000] "GET /api/users HTTP/1.1" 200 150 "-" "curl/7.68.0"
Without enrichment, you’d be searching through the raw message field. With enrichment, we can pull out that 200 and make it a first-class citizen.
Here’s how you set up a pipeline in Datadog:
- Navigate to Logs -> Configuration -> Pipelines.
- Click "New Pipeline".
- Name it something descriptive, like "Web Server Enrichment".
- Click "Add Processor".
The first processor we’ll add is a "Grokk" processor. This uses regular expressions to parse unstructured log data into structured attributes.
Processor Type: Grok
Grok Pattern: %{IPORHOST:clientip} - - \[%{HTTPDATE:timestamp}\] \"%{WORD:verb} %{DATA:request} HTTP/%{NUMBER:httpversion}\" %{NUMBER:status:int} %{NUMBER:bytes:int}
Activated: On
This pattern breaks down the Nginx log line. The key part for us is % {NUMBER:status:int}, which captures the numeric status code and labels it status, converting it to an integer.
Now, let’s add another processor to use that status attribute.
Processor Type: Attribute
Action: Add
Name: http.status_code
Value: {{status}}
Activated: On
This "Attribute" processor takes the value extracted by the Grok processor into the status attribute and assigns it to a new attribute called http.status_code. The {{status}} syntax tells Datadog to use the value from the status attribute.
Why this works:
The Grok processor acts as a parser, dissecting the raw log message into key-value pairs based on predefined patterns. The Attribute processor then takes one of those extracted values (status) and creates a new, distinct attribute (http.status_code) that Datadog can index and query independently. This means you can search for http.status_code:200 directly, instead of message:* 200 *.
You can chain multiple processors. For example, you might add a "URL Parser" processor after the Grok processor to extract path components from the request field, or a "User Agent Parser" to break down the useragent string.
Consider this log line:
2023-10-27T10:01:00Z INFO webserver.access: 192.168.1.11 - - [27/Oct/2023:10:01:00 +0000] "POST /api/v2/orders HTTP/1.1" 201 50 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
After the Grok processor, you’d have:
status: 201
clientip: 192.168.1.11
verb: POST
request: /api/v2/orders
useragent: Mozilla/5.0 (Windows NT 10.0; Win64; x64)
The Attribute processor then sets:
http.status_code: 201
This is incredibly powerful for cost savings. By parsing and creating dedicated attributes, Datadog can index these instead of the entire raw message for search. If you have many logs with similar message structures, this can significantly reduce the amount of data that needs full text indexing.
The order of processors matters. A "Conditional Processor" can be used to apply subsequent processors only if certain conditions are met, like if status is 404.
The most surprising thing about these pipelines is how granularly you can control the enrichment before indexing. You’re not just adding a tag; you’re creating structured fields that become first-class citizens for querying, aggregation, and alerting. This means you can filter logs by http.status_code:500 with the same efficiency as filtering by hostname, even if the status code was originally buried deep within a raw text message.
Once you’ve mastered attribute enrichment, the next logical step is learning how to use these enriched attributes to build sophisticated monitors and dashboards.