ML Model Versioning Tools Compared

The most surprising thing about MLflow and Hugging Face Hub is that they aren’t just for deploying models, but for tracking the entire lifecycle of a model, from training data to the final artifact, making reproducibility a first-class citizen.

Let’s see this in action. Imagine you’re training a sentiment analysis model using Hugging Face’s transformers library. You’ll want to log every experiment, not just the final model weights, but also the hyperparameters, the dataset version, and any metrics you care about.

First, you need to install the necessary libraries:

pip install mlflow transformers datasets evaluate

Now, let’s set up an MLflow experiment. You’d typically do this in your Python script:

import mlflow
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from datasets import load_dataset

# Set the MLflow tracking URI (e.g., to a local folder or a remote server)
mlflow.set_tracking_uri("http://localhost:5000") # Or "sqlite:///mlruns.db"

# Start an MLflow run
with mlflow.start_run(run_name="sentiment_analysis_training"):
    # Log hyperparameters
    learning_rate = 2e-5
    num_train_epochs = 3
    batch_size = 16
    mlflow.log_param("learning_rate", learning_rate)
    mlflow.log_param("num_train_epochs", num_train_epochs)
    mlflow.log_param("batch_size", batch_size)

    # Load dataset and tokenizer
    dataset = load_dataset("imdb")
    tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

    # Preprocess dataset (example)
    def preprocess_function(examples):
        return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512)

    tokenized_datasets = dataset.map(preprocess_function, batched=True)
    tokenized_datasets = tokenized_datasets.remove_columns(["text"])
    tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
    tokenized_datasets.set_format("torch")

    # Log dataset version (if applicable, e.g., from a data catalog or specific commit)
    # For simplicity, we'll log the dataset name and split
    mlflow.log_param("dataset_name", "imdb")
    mlflow.log_param("dataset_split", "train")

    # Load model
    model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

    # Define training arguments and trainer (simplified for demonstration)
    from transformers import TrainingArguments, Trainer
    from evaluate import load

    metric = load("accuracy")

    def compute_metrics(eval_pred):
        logits, labels = eval_pred
        predictions = np.argmax(logits, axis=-1)
        return metric.compute(predictions=predictions, references=labels)

    training_args = TrainingArguments(
        output_dir="./results",
        evaluation_strategy="epoch",
        learning_rate=learning_rate,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        num_train_epochs=num_train_epochs,
        weight_decay=0.01,
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_datasets["train"],
        eval_dataset=tokenized_datasets["test"],
        compute_metrics=compute_metrics,
        tokenizer=tokenizer,
    )

    # Train the model
    trainer.train()

    # Evaluate the model
    eval_results = trainer.evaluate()
    mlflow.log_metric("accuracy", eval_results["eval_accuracy"])
    mlflow.log_metric("f1", eval_results["eval_f1"]) # Assuming f1 is logged by compute_metrics

    # Save the model and tokenizer
    model_save_path = "./sentiment_model"
    tokenizer.save_pretrained(model_save_path)
    model.save_pretrained(model_save_path)

    # Log the model artifact with MLflow
    mlflow.transformers.log_model(
        transformers_model=model,
        artifact_path="sentiment_classifier",
        tokenizer=tokenizer,
        input_example={"text": "This movie was fantastic!"},
        registered_model_name="SentimentAnalysisModel"
    )

print("MLflow run completed. Model logged.")

After running this script, you’d start your MLflow UI (mlflow ui in your terminal) and see an experiment with a logged run. You can inspect all the logged parameters, metrics, and importantly, the saved model artifact.

The mlflow.transformers.log_model function is key here. It not only saves the model and tokenizer but also registers it with MLflow’s Model Registry. The registered_model_name="SentimentAnalysisModel" part tells MLflow to create or update a registered model named "SentimentAnalysisModel" with this logged version. This creates a central place to manage different versions of your model.

When you look at the registered model in the MLflow UI, you’ll see a list of versions. Each version corresponds to a specific mlflow.log_model call. You can then transition these models through different stages: "Staging," "Production," or "Archived." This is crucial for MLOps, allowing you to control which version of a model is being used in your applications.

The real power comes when you want to reproduce an experiment or deploy a specific version. You can retrieve a logged model from MLflow using its run ID or, more commonly, by its registered model name and version.

For example, to load the latest production version of your "SentimentAnalysisModel":

from mlflow import MlflowClient

client = MlflowClient()
model_name = "SentimentAnalysisModel"
latest_production_version = client.get_latest_versions(model_name, stages=["Production"])[0].version

# Load the model using MLflow's model flavor
loaded_model = mlflow.transformers.load_model(
    f"models:/{model_name}/{latest_production_version}"
)

# Or load directly from a run artifact
# run_id = "your_run_id_here"
# loaded_model = mlflow.transformers.load_model(f"runs:/{run_id}/sentiment_classifier")

# Now you can use loaded_model for inference
text_to_classify = "This was an amazing experience!"
inputs = loaded_model.tokenizer(text_to_classify, return_tensors="pt")
outputs = loaded_model.model(**inputs)
prediction = outputs.logits.argmax(-1).item()
print(f"Prediction: {prediction}")

This system allows you to link a specific model artifact back to the exact code, hyperparameters, and data used to create it, solving the "it worked on my machine" problem for machine learning.

What most people don’t realize is that MLflow’s Model Registry isn’t just a versioning tool; it’s a central governance layer. When you transition a model to "Production," you’re making a declaration that this specific, immutable artifact, tied to a specific run and its lineage, is now the one that your applications should query. This immutable link is what enables robust rollback strategies and auditable ML deployments.

The next step is to integrate this registered model into an automated deployment pipeline, perhaps using MLflow’s deployment tools or integrating with CI/CD systems.