Argo Workflows are failing because a specific step within your workflow is hitting an unrecoverable error, preventing the entire workflow from completing.

Common Causes and Fixes

  1. Container Image Pull Failure: The Kubernetes cluster cannot pull the Docker image specified for your step. This is the most frequent culprit.

    • Diagnosis: Check the pod status for your failed step. Look for ImagePullBackOff or ErrImagePull events.
      kubectl describe pod <pod-name> -n <namespace>
      
    • Fix:
      • Incorrect Image Name/Tag: Verify the image name and tag in your workflow definition. Ensure it exactly matches the image in your registry.
        # Workflow Definition Snippet
        container:
          image: "my-docker-registry/my-app:v1.2.3" # Double-check this
        
      • Private Registry Authentication: If using a private registry, ensure your Kubernetes cluster has imagePullSecrets configured correctly.
        # Create a secret for your registry credentials
        kubectl create secret docker-registry my-registry-secret \
          --docker-server=<your-registry-server> \
          --docker-username=<your-username> \
          --docker-password=<your-password> \
          --docker-email=<your-email> \
          -n <namespace>
        
        # Reference it in your workflow
        container:
          image: "my-private-registry/my-app:latest"
          imagePullPolicy: Always
          imagePullSecrets:
          - name: my-registry-secret
        
      • Registry Unreachable: The cluster nodes cannot reach the image registry. Check network policies, firewall rules, or DNS resolution.
    • Why it works: Kubernetes needs to download the container image to run your step. If it can’t find or access the image, it can’t start the container. Correcting the image reference or providing valid credentials/network access allows the pull to succeed.
  2. Resource Exceeded (CPU/Memory): The container for your step is requesting more CPU or memory than is available on the node, or it’s exceeding its defined limits.

    • Diagnosis: Check the pod events for OOMKilled (Out Of Memory) or look at resource utilization metrics for the node where the pod was scheduled.
      kubectl describe pod <pod-name> -n <namespace>
      kubectl top pod <pod-name> -n <namespace> --containers
      
    • Fix: Increase the resources.requests and resources.limits for CPU and memory in your workflow’s container spec.
      # Workflow Definition Snippet
      container:
        image: "my-app:latest"
        resources:
          requests:
            cpu: "500m"    # e.g., 0.5 CPU core
            memory: "1Gi"  # e.g., 1 Gigabyte of RAM
          limits:
            cpu: "1000m"   # e.g., 1 CPU core
            memory: "2Gi"  # e.g., 2 Gigabytes of RAM
      
    • Why it works: By increasing the allocated resources, you provide the container with sufficient capacity to run its process without being terminated by the Kubernetes node or its own defined limits.
  3. Application Error within the Container: The application running inside your container crashed due to a bug, misconfiguration, or unhandled exception.

    • Diagnosis: View the logs of the failed pod.
      kubectl logs <pod-name> -n <namespace>
      
      Look for stack traces, error messages, or exit codes from your application.
    • Fix: Debug and fix the application code or configuration. This is specific to your application.
      • Example: If your Python script fails with FileNotFoundError, ensure the file is present in the container image or mounted correctly.
      • Example: If a database connection fails, verify connection strings, credentials, and network access within the container.
    • Why it works: Addresses the root cause of the application’s failure, allowing it to complete its execution successfully.
  4. Command Execution Failure: The command or args specified for the container in your workflow definition exited with a non-zero status code.

    • Diagnosis: Check the container logs for the command’s output and any error messages.
      kubectl logs <pod-name> -n <namespace>
      
      The exit code is often visible in the pod’s status or events.
    • Fix: Correct the command or arguments. Ensure the executable exists in the container’s PATH or provide the full path.
      # Workflow Definition Snippet
      container:
        image: "ubuntu:latest"
        command: ["/bin/bash", "-c"]
        args:
        - "if [ ! -f /app/data.txt ]; then echo 'Error: data.txt not found' >&2; exit 1; fi && cat /app/data.txt" # Corrected logic
      
    • Why it works: A non-zero exit code signals an error to Kubernetes, which then marks the step as failed. Fixing the command to either succeed or handle errors gracefully (e.g., by exiting with 0 on non-critical issues) resolves this.
  5. Volume Mount Issues: The persistent volume (PV) or config map/secret that your step needs to access is not mounted correctly or is unavailable.

    • Diagnosis: Check pod events for FailedMount or MountVolume.SetUp errors.
      kubectl describe pod <pod-name> -n <namespace>
      
      Verify the volumeMounts and volumes definitions in your workflow.
    • Fix:
      • PV/PVC Not Bound: Ensure the PersistentVolumeClaim (PVC) referenced in your volume definition is bound to a PersistentVolume.
        kubectl get pvc <pvc-name> -n <namespace>
        kubectl get pv <pv-name> -n <namespace>
        
      • ConfigMap/Secret Missing: Verify that the ConfigMap or Secret exists in the same namespace as the workflow.
        kubectl get configmap <configmap-name> -n <namespace>
        kubectl get secret <secret-name> -n <namespace>
        
      • Incorrect Paths: Double-check the mountPath in volumeMounts and subPath if used.
    • Why it works: Kubernetes needs to attach storage or configuration data to the pod’s filesystem. Correctly configuring and ensuring the availability of these volumes allows the container to access necessary files and settings.
  6. Network Policy Blocking: Network policies within your Kubernetes cluster are preventing the pod from reaching external services (like databases, APIs) or even other internal services.

    • Diagnosis: This is harder to diagnose directly from pod events. You’ll often see timeouts in your application logs. Check if network policies are applied in the namespace.
      kubectl get networkpolicy -n <namespace>
      
      If policies exist, review them to ensure they allow egress traffic to the required destinations.
    • Fix: Adjust network policies to permit the necessary connections. For example, to allow egress to a database on port 5432:
      # Example NetworkPolicy
      apiVersion: networking.k8s.io/v1
      kind: NetworkPolicy
      metadata:
        name: allow-db-access
        namespace: <namespace>
      spec:
        podSelector: {} # Applies to all pods in the namespace
        policyTypes:
        - Egress
        egress:
        - to:
          - ipBlock:
              cidr: 10.0.0.0/24 # Replace with your DB's IP range
          ports:
          - protocol: TCP
            port: 5432
      
    • Why it works: Network policies enforce network segmentation. By explicitly allowing traffic to external or internal services, you remove network-level blocks that prevent your step from communicating and completing its task.

The next error you’ll likely encounter after fixing these is a "Workflow Deadline Exceeded" if your workflow has a timeout configured and the previous fixes, while resolving the immediate step failure, didn’t address a more fundamental performance issue or dependency delay.

Want structured learning?

Take the full Argo-workflows course →