Vault’s ML Credentials feature, designed to securely manage API keys for machine learning models, is leaking them because the credential rotation mechanism is failing to properly revoke old credentials upon generating new ones.
Common Causes and Fixes:
-
Incorrect
lease_durationConfiguration:- Diagnosis: Check the
lease_durationset for the credential role in Vault. This is the maximum time a credential can be active. If it’s too long, or if rotation fails to shorten it, old credentials can persist.
Look forvault read auth/aws/role/my-ml-rolelease_durationandrenewable. - Fix: Ensure
lease_durationis set to a reasonably short time (e.g.,30mfor 30 minutes) and thatrenewableis set totrue. If rotation is supposed to shorten leases, ensure your rotation script explicitly sets a new, shorterlease_durationfor the generated credentials.vault write auth/aws/role/my-ml-role \ credential_type=iam \ policy_arns=arn:aws:iam::123456789012:policy/MyMLPolicy \ lease_duration=30m \ renewable=true - Why it works: A shorter
lease_durationinherently limits the exposure window of any leaked credential. Settingrenewable=trueallows Vault to extend the lease, but critically, if the rotation process is correctly implemented, it should generate a new credential with a new lease and then revoke the old one, rather than just renewing the old one.
- Diagnosis: Check the
-
Improper Revocation in Custom Rotation Scripts:
- Diagnosis: If you’re using a custom script to rotate credentials (e.g., using
vault write auth/aws/role/my-ml-role/generate-access-key), inspect the script for explicit revocation calls. Are you callingvault lease revokeorvault token revokewith the correct lease ID or token ID of the old credential? - Fix: After generating a new credential, explicitly revoke the old one using its lease ID. The
generate-access-keyendpoint usually returns the lease ID of the newly generated credential, but you need to track and revoke the previous one.# Example snippet in a Python rotation script import hvac client = hvac.Client(url='http://127.0.0.1:8200', token='your-vault-token') # Assume 'old_lease_id' holds the lease ID of the credential to be revoked try: client.secrets.aws.revoke_generated_access_key(lease_id=old_lease_id) print(f"Successfully revoked old credential with lease ID: {old_lease_id}") except Exception as e: print(f"Error revoking old credential: {e}") # ... then generate new credentials ... - Why it works: Vault doesn’t automatically revoke old credentials when new ones are generated for the same role. You must explicitly tell Vault to revoke them. Failure to do so leaves the old, potentially leaked, credentials active.
- Diagnosis: If you’re using a custom script to rotate credentials (e.g., using
-
IAM Role/User Permissions for Vault:
- Diagnosis: The AWS IAM role or user that Vault is assuming to generate AWS credentials might lack the necessary permissions to revoke other IAM entities if your rotation strategy involves direct IAM user/role management, or if Vault’s internal AWS auth backend is misconfigured. Check Vault’s audit logs for permission denied errors related to AWS API calls.
- Fix: Ensure the IAM role/user associated with Vault’s AWS auth method has
iam:DeleteAccessKey,iam:UpdateAccessKey, andiam:ListAccessKeys(if applicable for tracking) permissions for the relevant IAM users/roles that Vault manages.{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "iam:CreateAccessKey", "iam:DeleteAccessKey", "iam:UpdateAccessKey", "iam:ListAccessKeys", "iam:GetAccessKeyLastUsed" ], "Resource": "arn:aws:iam::123456789012:user/vault-managed-user-*" } ] } - Why it works: Vault’s AWS auth backend often works by creating/managing IAM users or their access keys. If Vault cannot properly delete or update access keys for the IAM users it manages, it cannot revoke credentials effectively.
-
Lease Renewal Loop Without Revocation:
- Diagnosis: The rotation process might be hitting a condition where it renews the existing credential’s lease instead of generating a new one and revoking the old. This is common if the logic for detecting "stale" credentials is flawed. Check Vault’s audit logs for
lease_renewoperations that aren’t followed bylease_revokefor the same lease ID. - Fix: Modify your rotation logic. Instead of just renewing a lease, always aim to generate a new access key pair. Then, immediately revoke the previous access key pair using its lease ID.
- Why it works: This forces a complete lifecycle: create new, use new, revoke old. A simple renewal keeps the same key active indefinitely, defeating the purpose of rotation.
- Diagnosis: The rotation process might be hitting a condition where it renews the existing credential’s lease instead of generating a new one and revoking the old. This is common if the logic for detecting "stale" credentials is flawed. Check Vault’s audit logs for
-
Vault Server Clock Skew:
- Diagnosis: Significant time differences between Vault servers (if clustered) or between Vault and the target system (e.g., AWS) can cause lease expirations and revocations to be evaluated incorrectly. Check Vault server logs for time-related warnings.
- Fix: Ensure all Vault servers are synchronized using NTP. Verify that Vault’s system time is accurate.
sudo timedatectl - Why it works: Lease durations are time-based. If Vault thinks it’s earlier or later than it actually is, its lease management will be unreliable, potentially leaving leases active longer than intended or revoking them prematurely.
-
Incorrectly Configured
default_lease_ttlon Auth Method:- Diagnosis: The
default_lease_ttlon the AWS auth method itself can influence the TTL of generated credentials if not overridden by the role’slease_duration.
Look forvault read auth/aws/configdefault_lease_ttl. - Fix: Set the
default_lease_ttlon the auth method to a sensible value, but ensure the role’slease_durationis always the primary control for ML credential TTL. A common pattern is to set a shortdefault_lease_ttlon the auth method (e.g.,15m) and then specific, potentially longer but still controlled,lease_durationon the roles.vault write auth/aws/config \ default_lease_ttl="15m" \ max_lease_ttl="1h" - Why it works: While
lease_durationon the role is king, the auth method’sdefault_lease_ttlacts as a fallback and can influence the maximum time a credential can be valid if not explicitly set otherwise. If this is too high, it can mask issues with role-specific TTLs.
- Diagnosis: The
The next error you’ll likely hit is a permission denied error when Vault attempts to perform an action on AWS, indicating a misconfiguration in the IAM policies attached to the role Vault is assuming.