Cosmos DB’s Analytical Store is a high-throughput, low-cost storage layer that allows you to run analytical queries on your operational data without impacting your transactional workloads. When you query Cosmos DB using Azure Synapse Analytics, you’re leveraging the power of Synapse’s distributed query engine to process large datasets stored in the Analytical Store. This unlocks scenarios like business intelligence, machine learning, and data warehousing on your real-time data.
Here’s a look at how it works and what you need to know to get the most out of it.
The System in Action: A Synapse Query on Cosmos DB Analytical Store
Imagine you have a Cosmos DB container storing customer order data. Each order document might look something like this:
{
"id": "order-12345",
"customerId": "cust-abcde",
"orderDate": "2023-10-27T10:00:00Z",
"totalAmount": 150.75,
"items": [
{"productId": "prod-xyz", "quantity": 2, "price": 50.25},
{"productId": "prod-uvw", "quantity": 1, "price": 50.25}
],
"status": "Shipped"
}
You want to find the total revenue generated from "Shipped" orders in the last 30 days.
In Azure Synapse Analytics, you’d create a linked service to your Cosmos DB account. Then, you’d define an external table in Synapse that points to the Analytical Store of your Cosmos DB container.
-- Create a SQL pool in Synapse if you don't have one
CREATE DATABASE MySynapseDB;
USE MySynapseDB;
-- Create a Linked Service to your Cosmos DB account (this is done in Synapse Studio UI, not SQL script)
-- Assume the linked service is named 'CosmosDBLinkedService'
-- Create an external table pointing to the Cosmos DB Analytical Store
CREATE EXTERNAL TABLE dbo.CustomerOrders (
id VARCHAR(100) COLLATE SQL_Latin1_General_CP1_CI_AS,
customerId VARCHAR(100) COLLATE SQL_Latin1_General_CP1_CI_AS,
orderDate DATETIME2(7),
totalAmount DECIMAL(18, 2),
status VARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS
)
WITH (
DATA_SOURCE = 'CosmosDBLinkedService', -- Name of your linked service
LOCATION = 'your_cosmosdb_database/your_cosmosdb_container', -- Database and container name
FILE_FORMAT = 'PARQUET' -- Analytical Store uses Parquet format
);
Once the external table is defined, you can query it using standard T-SQL:
SELECT
SUM(totalAmount) AS TotalRevenue
FROM
dbo.CustomerOrders
WHERE
status = 'Shipped'
AND orderDate >= DATEADD(day, -30, GETDATE());
Synapse’s SQL pool will then efficiently query the Analytical Store, retrieving and processing the data without consuming RUs from your transactional store.
The Mental Model: Decoupled Processing
The core idea behind querying Cosmos DB’s Analytical Store with Synapse is decoupling analytical processing from transactional operations.
- Transactional Store: This is your primary Cosmos DB store, optimized for low-latency reads and writes, serving your live applications. It operates on a Request Unit (RU) model.
- Analytical Store: This is a separate, columnar storage format (based on Apache Parquet) that asynchronously ingests data from your transactional store. It’s designed for high-throughput, bulk analytical queries. Crucially, it has its own, much lower, cost structure and does not consume RUs from your transactional store.
- Azure Synapse Analytics: This is your analytical engine. When you connect Synapse to your Cosmos DB Analytical Store, Synapse reads data directly from the Parquet files in the Analytical Store. Synapse’s distributed query engine (SQL pool or Spark pool) then processes these files.
This separation means that even if you run extremely complex, long-running analytical queries, your application’s performance on the transactional store remains unaffected. You get the benefits of a data warehouse/data lake for analytics on your operational data without the typical trade-offs.
The Levers You Control
- Analytical Store Enablement: You must explicitly enable the Analytical Store for each Cosmos DB container you want to query analytically. This is a setting in the Cosmos DB portal or via API.
- Data Ingestion: By default, data flows into the Analytical Store with a latency of a few minutes. You can configure this ingestion to be near real-time (seconds) if needed, though it has cost implications.
- Synapse Integration:
- Linked Service: Establishing a connection from Synapse to your Cosmos DB account.
- External Table Definition: Specifying the
DATA_SOURCE(linked service),LOCATION(database/container), andFILE_FORMAT(PARQUET). - Synapse Compute: Choosing between Synapse SQL pools (for T-SQL) or Synapse Spark pools (for Spark SQL, Python, Scala, etc.) to run your queries.
- Schema Mapping: While the Analytical Store is schema-agnostic in its raw form, when you define an external table in Synapse, you are essentially defining a schema over the Parquet data. You can choose which properties from your JSON documents to expose and how to type them. Joins and aggregations are performed on this defined schema.
The Counter-Intuitive Detail
The Analytical Store is not a direct copy of your transactional data in a different format; it’s a columnar representation optimized for analytical reads. This means that while the data is eventually consistent with your transactional store, there’s a small ingestion latency. If you query the Analytical Store immediately after a write to the transactional store, you might not see the latest data. The default ingestion time is typically a few minutes, but this can be tuned. Understanding and accounting for this latency is key for use cases requiring near real-time analytics.
The next step is typically exploring how to integrate this data with other data sources in Synapse, such as data lakes (ADLS Gen2) or other databases, for more comprehensive analytical scenarios.