Django’s migrate command, when faced with data migrations on large tables, can bring your application to a grinding halt. This isn’t just about locking; it’s about the sheer volume of data being processed in a single transaction, overwhelming your database and your users.
Let’s see this in action. Imagine a simple UpdateStatus migration on a customers table with millions of rows:
# myapp/migrations/0002_update_customer_status.py
from django.db import migrations, models
class Migration(migrations.Migration):
dependencies = [
('myapp', '0001_initial'),
]
operations = [
migrations.RunPython(update_customer_status_data),
]
def update_customer_status_data(apps, schema_editor):
Customer = apps.get_model('myapp', 'Customer')
# This is the problematic line for large tables
for customer in Customer.objects.all():
if customer.last_login < timezone.now() - timezone.timedelta(days=365):
customer.status = 'inactive'
customer.save()
If you run python manage.py migrate, and your customers table has 10 million rows, the Customer.objects.all() will attempt to load all of them into memory. Even if it doesn’t fully load into memory (thanks to Django’s queryset optimizations), the database will likely perform a full table scan and lock the table for the duration of the update_customer_status_data function. This can take hours, during which your application can’t read or write to the customers table.
The core problem is that Django’s migrate command, by default, executes all RunPython operations within a single database transaction. For data migrations on large datasets, this means:
- Massive Transaction Size: The database has to keep track of all changes for the entire duration of the migration. This consumes significant transaction log space and memory.
- Table Locks: To ensure data integrity during the migration, the database often acquires locks on the affected tables. For a full table scan and update, this can be an exclusive lock, blocking all other read/write operations.
- Long Operation Time: Processing millions of rows takes time. If this is done in a single, long-running transaction, your application is effectively down for the duration.
To overcome this, we need to break down the migration into smaller, manageable chunks. The key is to process data in batches and commit each batch separately, thus releasing locks and reducing the transaction log size.
Here’s how you can refactor the update_customer_status_data function to handle this gracefully:
# myapp/migrations/0002_update_customer_status.py
from django.db import migrations, models
from django.utils import timezone
class Migration(migrations.Migration):
dependencies = [
('myapp', '0001_initial'),
]
operations = [
migrations.RunPython(update_customer_status_data_batched),
]
def update_customer_status_data_batched(apps, schema_editor):
Customer = apps.get_model('myapp', 'Customer')
batch_size = 1000 # Process 1000 records at a time
updated_count = 0
# Get the cutoff date once
cutoff_date = timezone.now() - timezone.timedelta(days=365)
# We need to iterate using .iterator() to avoid loading all objects into memory
# and to ensure we process distinct batches.
# We also use .filter() to narrow down the scope to only records that need updating.
# The .update() method is generally faster for bulk updates than individual saves,
# but for complex logic or when needing to trigger signals, individual saves are needed.
# For this example, we'll stick to individual saves within the loop for clarity on batching.
# To ensure we process in batches and don't re-process already updated records if the migration
# is interrupted and resumed, we can use a "processed" flag or filter by the condition.
# Filtering by the condition is cleaner.
# Fetch IDs in batches, then update those specific objects.
# This approach avoids loading the entire object and focuses on IDs.
# A common pattern is to use .iterator() and then process in batches.
# However, for data migrations that need to be idempotent and resumable,
# it's safer to filter for *unprocessed* records.
# If the status is already 'inactive', we don't need to touch it.
customers_to_update = Customer.objects.filter(
status__isnull=True # Or whatever your initial state is before update
).order_by('pk') # Order by primary key for consistent batching
# Use .iterator() for memory efficiency and to work with a live queryset
for customer in customers_to_update.iterator(chunk_size=batch_size):
if customer.last_login < cutoff_date:
customer.status = 'inactive'
customer.save() # Save individually within the loop
updated_count += 1
# This is where the magic happens: committing after each batch
if updated_count % batch_size == 0:
schema_editor.connection.commit() # Commit the current transaction
print(f"Processed and committed {updated_count} records...")
# Final commit for any remaining records
schema_editor.connection.commit()
print(f"Finished data migration. Total records updated: {updated_count}")
Let’s break down the fixes:
batch_size = 1000: We define a small, manageable number of records to process at once. 1000 is a common starting point, but you might tune this based on your database performance and row size.customers_to_update = Customer.objects.filter(...).order_by('pk'): We first filter down to only the records that actually need updating. This is crucial for efficiency. Ordering bypkensures that if the migration is interrupted, it can pick up where it left off without reprocessing records..iterator(chunk_size=batch_size): This is the workhorse. It tells Django to fetch records from the database in chunks ofbatch_size, rather than trying to load everything into memory at once. This drastically reduces memory consumption.customer.save(): We save each record individually. Whilebulk_updateis faster, individualsave()calls allow us to trigger model signals if necessary and are easier to integrate with per-batch commits.if updated_count % batch_size == 0: schema_editor.connection.commit(): This is the most critical part for preventing long-running transactions. After processingbatch_sizerecords, we explicitly commit the transaction. This frees up database resources, releases locks, and makes the changes visible.- Final
schema_editor.connection.commit(): Ensures any remaining records processed after the last full batch are also committed.
A common pitfall here is forgetting to filter for records that need updating. If you iterate over all records and then check the condition inside the loop, you’re still performing a full table scan and potentially updating records unnecessarily, which is slow. Filtering upfront (Customer.objects.filter(...)) is key.
Another subtle point is that schema_editor.connection.commit() might not be available or might behave differently across all database backends supported by Django. For PostgreSQL, this is the standard way. For other databases, you might need to use transaction.atomic() or other specific methods, but schema_editor.connection.commit() is the most portable within RunPython.
The next error you’ll likely encounter is related to django.db.utils.OperationalError: too many connections for role "..." if your batch size is still too large or your commit frequency too low, as each connection might hold resources for longer than anticipated.