The most surprising thing about querying petabyte-scale data in Dynatrace Grail is that you’re not actually "querying" in the traditional database sense; you’re instructing a massively parallel, distributed system to find and assemble the data you need, and it does it by leveraging a unique, time-series-first indexing strategy.

Let’s see it in action. Imagine you’re tracking user session data and want to find all sessions that experienced a specific error code (ERR-123) and lasted longer than 5 minutes. In Grail, this isn’t a SELECT * FROM sessions WHERE error_code = 'ERR-123' AND duration > 300. Instead, you’d use a query like this:

fetch logs, metrics
| filter level="ERROR" and message="*ERR-123*"
| summarize count() by session
| filter count_ > 0
| fields session
| join session = session (
    fetch logs
    | filter isSessionStart
    | fields session, timestamp
)
| join session = session (
    fetch logs
    | filter isSessionEnd
    | fields session, timestamp
)
| eval duration = timestamp_session_end - timestamp_session_start
| filter duration > 300
| fields session, duration

This query demonstrates how Grail works:

  1. fetch logs, metrics: This tells Grail to access the relevant data stores. It’s not about tables, but about data types and sources.
  2. filter level="ERROR" and message="*ERR-123*": This is the core of the filtering. Grail’s indexing is optimized for time-series data, meaning it can quickly identify records matching these criteria across its distributed storage. The * denotes a wildcard.
  3. summarize count() by session: Aggregating by session ID.
  4. filter count_ > 0: Ensuring we only consider sessions that actually had the error.
  5. fields session: Selecting only the session IDs that meet the error criteria.
  6. join session = session (...): This is where the "assembly" happens. We’re joining the list of error sessions with separate fetches for session start and end timestamps. Grail efficiently orchestrates these parallel fetches.
  7. eval duration = timestamp_session_end - timestamp_session_start: Calculating the duration.
  8. filter duration > 300: Applying the duration filter.
  9. fields session, duration: The final output.

The problem this solves is the ability to gain insights from massive volumes of operational data (logs, metrics, traces, events) without the traditional pain points of schema management, complex ETL, or pre-aggregation for every possible query. Grail’s architecture allows for ad-hoc analysis on raw, immutable data.

Internally, Grail is a lakehouse built on a time-series-first foundation. Data is ingested and indexed with a focus on time and unique identifiers (like session IDs, trace IDs, etc.). This means that when you query, Grail doesn’t scan raw files in the traditional sense. Instead, it uses its indexes to pinpoint the exact data blocks relevant to your query’s time range and filters. The fetch command is essentially a request to retrieve data from specific, indexed data partitions. The filter commands translate directly into index lookups. The join operations are optimized distributed operations that bring together data from different "streams" (logs, metrics) based on common identifiers.

The exact levers you control are primarily in how you structure your queries and how you ingest data. For instance, ensuring you have good cardinality on IDs like session or traceId is crucial. When fetching logs, you can explicitly target specific log sources or types. For metrics, you can specify the metric name and dimensions. The filter clauses are your primary tool for narrowing down the vast dataset, and the summarize, sort, and fields commands shape the output.

The real magic, and what most people miss, is how Grail handles data evolution. Because it’s built on immutable data blocks with a time-series index, you can often query data that was ingested before a certain field was even consistently logged or after a new field was introduced, and the query will simply return the relevant data where it exists. There’s no need for schema migrations in the traditional sense for historical data; the query engine understands how to interpret the data based on its index and the query’s context.

The next concept you’ll likely dive into is optimizing query performance by understanding data partitioning and indexing strategies within Grail.

Want structured learning?

Take the full Dynatrace course →