Before you even think about pushing that button, the most surprising truth about a production readiness checklist is that it’s not about checking boxes; it’s about building confidence that you’ve thought through the entire lifecycle of your service, not just the happy path.

Let’s walk through a hypothetical launch of a new microservice, "User Profile Service," which handles user data and preferences.

Imagine this is our service’s core logic, written in Go:

package main

import (
	"database/sql"
	"encoding/json"
	"fmt"
	"log"
	"net/http"
	"os"
	"time"

	_ "github.com/lib/pq" // PostgreSQL driver
)

type UserProfile struct {
	UserID    string `json:"user_id"`
	FirstName string `json:"first_name"`
	LastName  string `json:"last_name"`
	Email     string `json:"email"`
	CreatedAt time.Time `json:"created_at"`
}

var db *sql.DB

func main() {
	// Database connection string from environment variable
	dbConnStr := os.Getenv("DATABASE_URL")
	if dbConnStr == "" {
		log.Fatal("DATABASE_URL environment variable not set")
	}

	var err error
	db, err = sql.Open("postgres", dbConnStr)
	if err != nil {
		log.Fatalf("Failed to open database connection: %v", err)
	}
	defer db.Close()

	// Ping the database to ensure connection is valid
	err = db.Ping()
	if err != nil {
		log.Fatalf("Failed to ping database: %v", err)
	}
	log.Println("Database connected successfully.")

	// Configure connection pool
	db.SetMaxOpenConns(25)
	db.SetMaxIdleConns(10)
	db.SetConnMaxLifetime(5 * time.Minute)

	http.HandleFunc("/profile/{user_id}", getUserProfile)
	log.Println("Server starting on :8080")
	log.Fatal(http.ListenAndServe(":8080", nil))
}

func getUserProfile(w http.ResponseWriter, r *http.Request) {
	userID := r.PathValue("user_id") // Using Go 1.22+ PathValue

	var profile UserProfile
	query := "SELECT user_id, first_name, last_name, email, created_at FROM user_profiles WHERE user_id = $1"
	err := db.QueryRow(query, userID).Scan(&profile.UserID, &profile.FirstName, &profile.LastName, &profile.Email, &profile.CreatedAt)

	if err != nil {
		if err == sql.ErrNoRows {
			http.NotFound(w, r)
			log.Printf("User not found: %s", userID)
			return
		}
		http.Error(w, "Internal Server Error", http.StatusInternalServerError)
		log.Printf("Database query error for user %s: %v", userID, err)
		return
	}

	w.Header().Set("Content-Type", "application/json")
	json.NewEncoder(w).Encode(profile)
}

This service retrieves user profile data from a PostgreSQL database. When we consider production readiness, we’re not just asking "does it run?" but "can it handle real-world conditions?"

The Core Problem: Unseen Dependencies and Unhandled Failures

At its heart, a production readiness checklist is about surfacing and mitigating risks associated with external dependencies (databases, other services, APIs) and internal failure modes (resource exhaustion, unexpected inputs, network issues). It forces us to ask: "What happens when things don’t go as planned?"

Key Areas to Verify:

  1. Observability: Can you see what’s happening?

    • Logging: Is it structured, at the right levels (INFO, WARN, ERROR), and sent to a central system?
      • Check: Run the service, hit an endpoint, check your log aggregator (e.g., Splunk, ELK, Datadog Logs). Do you see requests, errors, and successful operations?
      • Fix: If logs are missing, instrument your code. For example, ensure log.Printf statements are present for critical operations and errors.
      • Why: Structured logs allow for easy searching, filtering, and alerting on specific events.
    • Metrics: Are you collecting key performance indicators (KPIs) and system health metrics?
      • Check: Deploy the service with a metrics exporter (e.g., Prometheus client library). Hit an endpoint. Query your monitoring system (e.g., Prometheus, Grafana) for request counts, latency, error rates, and database connection pool usage. Look for http_requests_total, http_request_duration_seconds_bucket, db_connections_open.
      • Fix: Integrate a metrics library. For Go, github.com/prometheus/client_golang/prometheus/promhttp is common. Expose metrics on a /metrics endpoint.
      • Why: Metrics provide a quantitative view of performance and health, enabling proactive issue detection and capacity planning.
    • Tracing: Can you follow a request across service boundaries?
      • Check: If this service calls other services (or is called by them), ensure distributed tracing (e.g., OpenTelemetry, Jaeger) is enabled. Make a request that spans multiple services and visualize the trace.
      • Fix: Instrument your service with a tracing SDK. For incoming requests, ensure you propagate trace context headers. For outgoing requests, create new spans.
      • Why: Tracing is crucial for debugging performance bottlenecks and errors in complex distributed systems.
  2. Dependencies: How do you handle external services?

    • Database Connectivity & Health: Is the database reachable, and are connections managed well?
      • Check: Use docker exec <db_container_id> pg_isready to check PostgreSQL health. In your service logs, monitor for database connection failed or dial tcp ...: connect: connection refused errors. Check db.Ping() success on startup.
      • Fix: Ensure your DATABASE_URL environment variable is correct and the database is accessible from your service’s network. Configure db.SetMaxOpenConns, db.SetMaxIdleConns, and db.SetConnMaxLifetime appropriately based on expected load and database capacity.
      • Why: Robust database connection management prevents resource exhaustion on both the application and database sides and ensures the service can recover from transient network issues.
    • Dependency Timeouts and Retries: What happens when a dependency is slow or unavailable?
      • Check: Introduce artificial latency to the database query (e.g., pg_sleep(5) in a test query). Observe if your service hangs indefinitely or returns a timely error.
      • Fix: Implement timeouts on database queries (context.WithTimeout in Go’s sql.DB methods) and consider idempotent retry logic with exponential backoff for transient failures (e.g., using a library like github.com/avast/retry-go).
      • Why: Timeouts prevent cascading failures by stopping requests that would otherwise wait forever. Retries can help overcome temporary network glitches or overloaded dependencies.
  3. Resilience and Failure Handling: What happens when things break internally?

    • Graceful Shutdown: Does your service clean up resources when it receives a termination signal?
      • Check: Send a SIGTERM signal to your service’s process (kill <pid>). Observe if the database connection is closed (db.Close()) and if ongoing requests are allowed to finish or are cancelled cleanly.
      • Fix: Use context.Background() and signal.Notify to catch SIGTERM or SIGINT. In the signal handler, initiate a shutdown process that stops accepting new requests, waits for existing ones to complete (or times out), and then closes resources like the database connection.
      • Why: Graceful shutdown prevents data corruption and ensures a clean state transition during deployments or restarts.
    • Error Handling: Are all potential error paths handled?
      • Check: Test invalid inputs (e.g., requesting a non-existent user_id). Verify you get a 404 Not Found. Test malformed requests if applicable.
      • Fix: Explicitly check for sql.ErrNoRows and return http.NotFound. For other database errors, return http.StatusInternalServerError and log the detailed error.
      • Why: Comprehensive error handling provides clear feedback to clients and developers, aiding debugging and preventing unexpected behavior.
    • Resource Limits: Are you setting resource constraints (CPU, memory)?
      • Check: If running in Kubernetes, define resources.limits and resources.requests. Use tools like kubectl top pod to see actual usage. If running locally, use docker stats.
      • Fix: Set appropriate requests and limits in your Kubernetes deployment manifests. Start with conservative values and adjust based on observed load.
      • Why: Resource limits prevent a single service from consuming all available resources on a node, impacting other services and ensuring predictable performance.
  4. Configuration Management: Is your service configurable without redeployment?

    • Environment Variables: Are all external configurations (database URLs, API keys, feature flags) managed via environment variables?
      • Check: Verify that os.Getenv("DATABASE_URL") is used and that the value is correctly passed during deployment.
      • Fix: Refactor any hardcoded configuration values to use environment variables.
      • Why: Environment variables decouple configuration from code, making it easy to adapt the service to different environments (dev, staging, prod) without code changes.
  5. Security: Are you protecting sensitive data and access?

    • Secrets Management: Are sensitive values (API keys, passwords) stored securely and injected as environment variables or mounted as secrets, not hardcoded?
      • Check: Review your deployment configuration (e.g., Kubernetes Secrets, HashiCorp Vault integration). Ensure no secrets appear in your codebase or container images.
      • Fix: Use a secrets management system and inject secrets into the application at runtime.
      • Why: Prevents accidental exposure of sensitive credentials.
    • Input Validation: Are you sanitizing and validating all user-provided input?
      • Check: Try to inject malicious strings or unexpected data types into request parameters.
      • Fix: Implement robust validation for all incoming request parameters, headers, and body payloads.
      • Why: Protects against injection attacks (SQL injection, XSS) and malformed data causing crashes.
  6. Scalability and Performance: Can it handle the expected load?

    • Load Testing: Have you simulated expected production traffic?
      • Check: Use tools like k6, JMeter, or vegeta to run load tests against your staging environment. Monitor latency, error rates, and resource utilization.
      • Fix: Identify bottlenecks revealed by load tests (e.g., slow database queries, inefficient algorithms, insufficient connection pools) and optimize them.
      • Why: Proactively identifies performance issues before they impact real users.
    • Connection Pooling: Are database and other resource connections pooled effectively?
      • Check: Monitor the number of open database connections using your metrics. Ensure it stays within configured limits and doesn’t grow unbounded.
      • Fix: Properly configure connection pool sizes (db.SetMaxOpenConns, db.SetMaxIdleConns). Ensure connections are always returned to the pool (e.g., by closing *sql.Rows correctly).
      • Why: Reusing connections significantly reduces the overhead of establishing new ones, improving performance and reducing load on the dependency.

The Next Hurdle: Canary Deployments and Rollbacks

Once your checklist is complete and you’ve built that confidence, the next logical step before a full rollout is to master staged deployments like canary releases, ensuring you can rapidly and safely revert if unforeseen issues arise.

Want structured learning?

Take the full DevOps & Platform Engineering course →