Vault’s ML Credentials feature, designed to securely manage API keys for machine learning models, is leaking them because the credential rotation mechanism is failing to properly revoke old credentials upon generating new ones.

Common Causes and Fixes:

  1. Incorrect lease_duration Configuration:

    • Diagnosis: Check the lease_duration set for the credential role in Vault. This is the maximum time a credential can be active. If it’s too long, or if rotation fails to shorten it, old credentials can persist.
      vault read auth/aws/role/my-ml-role
      
      Look for lease_duration and renewable.
    • Fix: Ensure lease_duration is set to a reasonably short time (e.g., 30m for 30 minutes) and that renewable is set to true. If rotation is supposed to shorten leases, ensure your rotation script explicitly sets a new, shorter lease_duration for the generated credentials.
      vault write auth/aws/role/my-ml-role \
          credential_type=iam \
          policy_arns=arn:aws:iam::123456789012:policy/MyMLPolicy \
          lease_duration=30m \
          renewable=true
      
    • Why it works: A shorter lease_duration inherently limits the exposure window of any leaked credential. Setting renewable=true allows Vault to extend the lease, but critically, if the rotation process is correctly implemented, it should generate a new credential with a new lease and then revoke the old one, rather than just renewing the old one.
  2. Improper Revocation in Custom Rotation Scripts:

    • Diagnosis: If you’re using a custom script to rotate credentials (e.g., using vault write auth/aws/role/my-ml-role/generate-access-key), inspect the script for explicit revocation calls. Are you calling vault lease revoke or vault token revoke with the correct lease ID or token ID of the old credential?
    • Fix: After generating a new credential, explicitly revoke the old one using its lease ID. The generate-access-key endpoint usually returns the lease ID of the newly generated credential, but you need to track and revoke the previous one.
      # Example snippet in a Python rotation script
      import hvac
      
      client = hvac.Client(url='http://127.0.0.1:8200', token='your-vault-token')
      
      # Assume 'old_lease_id' holds the lease ID of the credential to be revoked
      try:
          client.secrets.aws.revoke_generated_access_key(lease_id=old_lease_id)
          print(f"Successfully revoked old credential with lease ID: {old_lease_id}")
      except Exception as e:
          print(f"Error revoking old credential: {e}")
      
      # ... then generate new credentials ...
      
    • Why it works: Vault doesn’t automatically revoke old credentials when new ones are generated for the same role. You must explicitly tell Vault to revoke them. Failure to do so leaves the old, potentially leaked, credentials active.
  3. IAM Role/User Permissions for Vault:

    • Diagnosis: The AWS IAM role or user that Vault is assuming to generate AWS credentials might lack the necessary permissions to revoke other IAM entities if your rotation strategy involves direct IAM user/role management, or if Vault’s internal AWS auth backend is misconfigured. Check Vault’s audit logs for permission denied errors related to AWS API calls.
    • Fix: Ensure the IAM role/user associated with Vault’s AWS auth method has iam:DeleteAccessKey, iam:UpdateAccessKey, and iam:ListAccessKeys (if applicable for tracking) permissions for the relevant IAM users/roles that Vault manages.
      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Effect": "Allow",
                  "Action": [
                      "iam:CreateAccessKey",
                      "iam:DeleteAccessKey",
                      "iam:UpdateAccessKey",
                      "iam:ListAccessKeys",
                      "iam:GetAccessKeyLastUsed"
                  ],
                  "Resource": "arn:aws:iam::123456789012:user/vault-managed-user-*"
              }
          ]
      }
      
    • Why it works: Vault’s AWS auth backend often works by creating/managing IAM users or their access keys. If Vault cannot properly delete or update access keys for the IAM users it manages, it cannot revoke credentials effectively.
  4. Lease Renewal Loop Without Revocation:

    • Diagnosis: The rotation process might be hitting a condition where it renews the existing credential’s lease instead of generating a new one and revoking the old. This is common if the logic for detecting "stale" credentials is flawed. Check Vault’s audit logs for lease_renew operations that aren’t followed by lease_revoke for the same lease ID.
    • Fix: Modify your rotation logic. Instead of just renewing a lease, always aim to generate a new access key pair. Then, immediately revoke the previous access key pair using its lease ID.
    • Why it works: This forces a complete lifecycle: create new, use new, revoke old. A simple renewal keeps the same key active indefinitely, defeating the purpose of rotation.
  5. Vault Server Clock Skew:

    • Diagnosis: Significant time differences between Vault servers (if clustered) or between Vault and the target system (e.g., AWS) can cause lease expirations and revocations to be evaluated incorrectly. Check Vault server logs for time-related warnings.
    • Fix: Ensure all Vault servers are synchronized using NTP. Verify that Vault’s system time is accurate.
      sudo timedatectl
      
    • Why it works: Lease durations are time-based. If Vault thinks it’s earlier or later than it actually is, its lease management will be unreliable, potentially leaving leases active longer than intended or revoking them prematurely.
  6. Incorrectly Configured default_lease_ttl on Auth Method:

    • Diagnosis: The default_lease_ttl on the AWS auth method itself can influence the TTL of generated credentials if not overridden by the role’s lease_duration.
      vault read auth/aws/config
      
      Look for default_lease_ttl.
    • Fix: Set the default_lease_ttl on the auth method to a sensible value, but ensure the role’s lease_duration is always the primary control for ML credential TTL. A common pattern is to set a short default_lease_ttl on the auth method (e.g., 15m) and then specific, potentially longer but still controlled, lease_duration on the roles.
      vault write auth/aws/config \
          default_lease_ttl="15m" \
          max_lease_ttl="1h"
      
    • Why it works: While lease_duration on the role is king, the auth method’s default_lease_ttl acts as a fallback and can influence the maximum time a credential can be valid if not explicitly set otherwise. If this is too high, it can mask issues with role-specific TTLs.

The next error you’ll likely hit is a permission denied error when Vault attempts to perform an action on AWS, indicating a misconfiguration in the IAM policies attached to the role Vault is assuming.

Want structured learning?

Take the full AI Security course →