Microsoft Purview can govern your data assets, but it’s not just a fancy catalog; it’s a system that actively helps you understand and control your data’s journey.
Let’s see Purview in action. Imagine you’ve got data scattered across Azure Data Lake Storage (ADLS) Gen2, SQL Server, and even some flat files in a local share. You need to know who has access to what, where sensitive data resides, and how data flows from source to report.
First, you’d set up a Purview account in your Azure subscription. Then, you’d configure scans to discover your data assets.
# Example: Scan configuration for ADLS Gen2
az purview scan create \
--resource-group myPurviewResourceGroup \
--account-name myPurviewAccount \
--scan-name myAdlsScan \
--kind AzureResourceHdfs \
--connected-store-name myadlsgen2account.dfs.core.windows.net \
--resource-group-name myDataResourceGroup \
--subscription-id <your-subscription-id> \
--authentication-type ManagedIdentity \
--schedule '{ "recurrence": { "frequency": "Day", "interval": 1 } }'
This command tells Purview to connect to your ADLS Gen2 account using its managed identity (which needs the appropriate reader role on the storage account) and scan it daily. Purview then inspects the files, identifies schemas, and extracts metadata.
Once scanned, you can explore your data assets in the Purview Governance Portal. Here’s what it looks like when you view a specific table in SQL Server:
- Overview: Shows the dataset name, description, owner, and classification labels (e.g., "Credit Card Number," "Personal Information").
- Schema: Lists all columns, their data types, and descriptions.
- Lineage: Visualizes how data flows into this table from other sources and how it’s used by downstream processes or reports. This is often the most powerful part, showing the actual ETL jobs or data pipelines that populated it.
- Glossary: Links to business terms defined in Purview’s glossary, providing business context.
Purview solves the problem of data sprawl and lack of trust. It brings disparate data sources under a unified umbrella, making them searchable, understandable, and governable. Internally, it works by:
- Registration: You register your data sources (ADLS, SQL, etc.) with Purview.
- Scanning: Purview uses registered credentials to connect to sources, extract metadata (schemas, table names, file paths), and apply classification rules.
- Classification: Using built-in and custom rules, Purview identifies sensitive data types (PII, financial data) within your datasets.
- Lineage Extraction: For supported sources (like Azure Data Factory, SQL Server Integration Services), Purview can trace data movement and transformations, building a visual lineage graph.
- Business Glossary: You define business terms, their definitions, and link them to technical assets, bridging the gap between IT and business users.
- Data Catalog: All this information is indexed, making data assets discoverable through search.
- Access Policies: While Purview itself doesn’t directly manage access for all sources, it integrates with Azure Active Directory and can help enforce data access policies by providing visibility into who should have access based on roles and classifications.
The real magic isn’t just cataloging; it’s the interconnectedness. When you classify a column in ADLS Gen2 as "Passport Number," and that data is then copied via Azure Data Factory into a SQL Server table, Purview automatically updates the lineage to show the flow and propagates the "Passport Number" classification to the SQL table’s column. This means your sensitive data tracking is continuous and automated.
Most people think of Purview as just a search engine for data. However, its true strength lies in its ability to infer relationships and propagate metadata automatically. When a data pipeline moves data, Purview doesn’t just record the fact that data moved; it understands the context of that data – its schema, its classifications, and its business meaning – and ensures that context travels with the data, even across different technologies. This automated metadata propagation is what enables true data governance at scale.
The next step after understanding your data assets is defining and enforcing data access controls based on these discovered classifications and ownership.