Azure Databricks is a managed Spark analytics service that integrates with Azure, offering a collaborative platform for data engineers, data scientists, and analysts.

Here’s what a typical Databricks workspace looks like in action:

Imagine you’re a data engineer. Your task is to ingest data from a CSV file stored in Azure Data Lake Storage Gen2, transform it, and then load it into an Azure Synapse Analytics data warehouse.

First, you’d create a Databricks workspace in the Azure portal. You choose a region, a resource group, and a name. You also select a pricing tier – Standard, Premium, or Trial. For this example, let’s say you choose Premium.

Once the workspace is provisioned, you’ll access it through the Azure portal. Inside Databricks, you’ll see a clean, web-based interface.

To start, you need a cluster – Databricks’ term for a Spark environment. You navigate to the "Compute" section and click "Create cluster."

You’d configure it like this:

  • Cluster name: my-ingestion-cluster
  • Cluster mode: Standard
  • Databricks runtime version: 13.3 LTS (Scala 2.12, Spark 3.4.1) (LTS means Long Term Support, which is good for production).
  • Node type: Standard_DS3_v2 (a general-purpose VM).
  • Workers: You’d set a minimum of 2 and a maximum of 8 for autoscaling. This means if your job gets heavy, Databricks will automatically add more worker nodes up to 8. If it’s light, it scales down to 2.
  • Auto termination: 120 minutes. This is crucial for cost savings. If the cluster is idle for 120 minutes, it shuts down automatically.

After clicking "Create cluster," it will start up. This might take a few minutes.

Now, you’ll create a notebook. Notebooks are where you write and run your Spark code. You navigate to "Workspace," click the down arrow next to your username, and select "Create" -> "Notebook."

You’d name it data-ingestion-transform and set the language to Python. You then attach this notebook to your my-ingestion-cluster.

Inside the notebook, you’d write Python code using the Spark API.

# Mount ADLS Gen2 to access files
storage_account_name = "yourdatalakeaccount"
container_name = "yourcontainer"
mount_point = "/mnt/adlsgen2"

spark.conf.set(f"fs.azure.account.key.{storage_account_name}.dfs.core.windows.net", "YOUR_STORAGE_ACCOUNT_KEY")
dbutils.fs.mount(
  source = f"wasbs://{container_name}@{storage_account_name}.dfs.core.windows.net/",
  mount_point = mount_point,
  extra_configs = {f"fs.azure.account.key.{storage_account_name}.dfs.core.windows.net": "YOUR_STORAGE_ACCOUNT_KEY"}
)

# Read CSV from ADLS Gen2
csv_file_path = f"{mount_point}/input/sales_data.csv"
df = spark.read.format("csv") \
  .option("header", "true") \
  .option("inferSchema", "true") \
  .load(csv_file_path)

# Basic transformation: filter sales greater than 1000
transformed_df = df.filter(df["Sales"] > 1000)

# Show first 5 rows of transformed data
transformed_df.show(5)

# Write to Azure Synapse Analytics
synapse_serverless_endpoint = "your-synapse-workspace.sql.azuresynapse.net"
synapse_database = "your_dw_database"
synapse_table = "transformed_sales"
synapse_user = "your_dw_user"
synapse_password = "YOUR_DW_PASSWORD" # In production, use Azure Key Vault for secrets

spark.conf.set(f"spark.synapse.library.jars", "com.microsoft.sqlserver.jdbc:mssql-jdbc:11.2.0.jre11")
spark.conf.set(f"spark.synapse.user", synapse_user)
spark.conf.set(f"spark.synapse.password", synapse_password)

transformed_df.write.format("com.microsoft.sqlserver.jdbc.spark") \
  .mode("overwrite") \
  .option("url", f"jdbc:sqlserver://{synapse_serverless_endpoint};database={synapse_database}") \
  .option("dbtable", synapse_table) \
  .save()

print(f"Data successfully written to Synapse table: {synapse_table}")

This code demonstrates a common pattern: connecting to storage, reading data, performing transformations, and writing to a destination. The dbutils.fs.mount command is an Azure Databricks utility to make external storage accessible like a local file system. The spark.read.format("csv") and transformed_df.write.format("com.microsoft.sqlserver.jdbc.spark") are core Spark API calls.

The power of Databricks lies in its unified interface and managed Spark infrastructure, abstracting away the complexities of cluster management and allowing you to focus on data processing. You control compute resources (cluster size, node types, autoscaling), runtime versions (Spark, Scala, Python), and data access credentials.

A key detail often overlooked is how Databricks handles secrets. While you can hardcode credentials for quick tests, it’s vital to use Azure Key Vault integration for production environments. You can store your ADLS Gen2 access key or Synapse password in Key Vault and then access them securely within your Databricks notebook using dbutils.secrets.get(scope="your_keyvault_scope", key="your_secret_name"). This prevents sensitive information from being exposed in your code.

The next step after ingesting and transforming data is often scheduling these jobs to run automatically.

Want structured learning?

Take the full Azure course →