Fine-tuning is more like a black box than a science, and Weights & Biases is the X-ray machine that lets you see inside.
Let’s watch a model learn. Imagine you’re fine-tuning a BERT model for sentiment analysis. You’ve got your data, your hyperparameters, and your script. You run it with wandb init and then wandb log inside your training loop.
import wandb
import torch
# Initialize W&B
wandb.init(project="sentiment-finetune", entity="your_username")
# Assume model, optimizer, dataloader are defined
model = ...
optimizer = ...
dataloader = ...
for epoch in range(num_epochs):
for batch in dataloader:
inputs, labels = batch
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
# Log metrics to W&B
wandb.log({"loss": loss.item(), "accuracy": calculate_accuracy(outputs, labels)})
wandb.finish()
This simple wandb.log call is where the magic happens. Each time it executes, it sends a dictionary of metrics to your W&B project. These aren’t just numbers; they’re time-series data points that build a rich picture of your training process.
Now, what problem does this solve? Without W&B, you’re staring at terminal output, maybe saving checkpoints, and hoping for the best. You can’t easily compare hyperparameters, visualize learning curves across runs, or see exactly when your model started overfitting. Fine-tuning involves many moving parts: learning rate schedules, optimizer states, dataset shuffling, and subtle model architecture changes. Each one can drastically alter the outcome. W&B captures the state of all these variables and their impact on your model’s performance, run after run.
Internally, W&B acts as a robust data pipeline. When wandb.log is called, the W&B client serializes your metrics and any associated data (like model gradients or activations, if you configure it to do so) and sends them to the W&B cloud service. There, they are stored, indexed, and made available through a powerful dashboard. This dashboard isn’t just a static report; it’s an interactive environment where you can plot metrics, compare runs side-by-side, slice and dice your data, and even stream live logs. You can log anything from scalar metrics like loss and accuracy to larger objects like model checkpoints, sample predictions, and even custom visualizations using HTML or Plotly.
The exact levers you control are primarily within the wandb.init() call and the wandb.log() dictionary. wandb.init() takes arguments like project, entity, name (for a specific run), config (a dictionary of hyperparameters), and tags. The config dictionary is crucial:
wandb.init(
project="sentiment-finetune",
entity="your_username",
config={
"learning_rate": 0.001,
"batch_size": 32,
"epochs": 10,
"model_name": "bert-base-uncased",
"optimizer": "AdamW"
}
)
This logs your hyperparameters alongside your metrics, allowing W&B to automatically generate comparison tables and filter runs based on specific settings. The wandb.log() dictionary is your canvas for everything else:
wandb.log({
"epoch": epoch,
"train/loss": loss.item(),
"train/accuracy": accuracy,
"val/loss": val_loss,
"val/accuracy": val_accuracy,
"lr_scheduler_step": lr_step_count
})
By prefixing metrics with train/ or val/, you automatically organize them into separate plots in the W&B UI.
One aspect that often surprises people is how easily W&B integrates with system-level monitoring. Beyond just logging your model’s performance metrics, you can also log resource utilization. For example, you can use the wandb.watch() function to automatically log gradients and parameters:
# Log gradients and parameters
wandb.watch(model, log="all", log_freq=100)
This sends a stream of data about your model’s internal state, allowing you to correlate performance dips or spikes with changes in gradient magnitudes or parameter updates. It’s like having a real-time oscilloscope for your neural network, showing you not just what happened to the performance, but how the model’s internal dynamics changed leading up to it. This granular view can be invaluable for debugging complex failure modes that aren’t obvious from just looking at loss curves.
The next step in mastering your fine-tuning runs is exploring hyperparameter sweeps.