Isolate ML Workloads with Network Security (2026)

It’s surprisingly easy to accidentally expose your ML training or inference workloads to the entire internet, even if you think they’re locked down.

Let’s look at a typical setup: you’ve got a Kubernetes cluster running your ML jobs. Your ML models are exposed via a service, say, a LoadBalancer type, and you assume the cloud provider’s network security handles the rest. But what if your ML workloads need to pull data from a separate S3 bucket, or push results to another service? If those egress connections aren’t carefully controlled, your ML instances could become a pivot point for attackers.

Here’s how a common scenario might look. You’re running training jobs on EKS, and they need to access data stored in an S3 bucket.

apiVersion: v1
kind: Pod
metadata:
  name: ml-training-job
spec:
  containers:
  - name: training-container
    image: my-ml-trainer:latest
    resources:
      limits:
        cpu: "4"
        memory: "16Gi"
    env:
    - name: AWS_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          name: aws-credentials
          key: aws_access_key_id
    - name: AWS_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          name: aws-credentials
          key: aws_secret_access_key

This pod has credentials to access AWS. If the pod’s network egress isn’t restricted, it can talk to any AWS service. An attacker who compromises this pod could use those credentials to access your S3 bucket, or worse, spin up expensive EC2 instances in your account.

The core problem is that network access in cloud-native environments is often too broad by default. Kubernetes’ CNI (Container Network Interface) and cloud provider security groups, while powerful, require explicit configuration to enforce least privilege for your ML workloads.

The solution is to combine Kubernetes Network Policies with cloud provider firewall rules to create a defense-in-depth strategy.

First, let’s restrict egress traffic from your ML pods. You can use Kubernetes Network Policies to define which pods can communicate with which other pods and external endpoints.

Here’s a Network Policy that only allows egress to specific IP ranges or CIDRs, like your S3 endpoint:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: restrict-ml-egress
  namespace: ml-workloads
spec:
  podSelector:
    matchLabels:
      app: ml-training
  policyTypes:
  - Egress
  egress:
  - to:
    - ipBlock:
        cidr: 52.216.0.0/16 # Example CIDR for AWS S3 us-east-1
    ports:
    - protocol: TCP
      port: 443
  - to:
    - podSelector: {} # Allow egress to other pods within the same namespace

This policy, applied to pods with the label app: ml-training in the ml-workloads namespace, explicitly permits TCP traffic on port 443 to the S3 IP range. You’ll need to find the correct CIDRs for the AWS region your S3 bucket resides in. AWS publishes these ranges, and they change, so you’ll need a mechanism to keep them updated, or rely on cloud-specific integrations that map service endpoints to IPs.

Next, you need to ensure your Kubernetes nodes themselves are protected. Cloud provider security groups are your first line of defense. If your ML pods are running on EC2 instances (as is common with EKS), you should configure the security group attached to those instances.

For example, if your ML instances have a security group named ml-node-sg, you’d modify it to:

Allow Ingress (only for your ML service): If your ML model is served via a LoadBalancer service, the security group should only allow ingress on the service port (e.g., 80 or 443) from trusted IP ranges (e.g., your internal network, specific API gateways).
Allow Egress (only for necessary services): Crucially, restrict all egress traffic by default. Then, add specific rules to allow egress only to essential endpoints. This includes the S3 CIDR range mentioned above, any internal API endpoints, or necessary external services. Do not allow 0.0.0.0/0.

A common mistake is to allow all egress from the EC2 instance’s security group because it’s "easier" during development. This completely negates the purpose of Network Policies. The security group acts as the host-level firewall, and Network Policies act as the in-cluster firewall, providing layered security.

Consider your ML inference endpoints. If they are exposed via an Ingress controller, the Ingress controller’s security group needs to be restricted. The pods behind the Ingress controller should also have Network Policies applied.

Here’s an Ingress resource example:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: ml-inference-ingress
  namespace: ml-workloads
  annotations:
    nginx.ingress.kubernetes.io/whitelist-source-range: "192.168.1.0/24" # Only allow access from internal network
spec:
  rules:
  - host: inference.example.com
    http:
      paths:
      - path: /predict
        pathType: Prefix
        backend:
          service:
            name: ml-inference-service
            port:
              number: 8080

And a corresponding Network Policy for the inference pods:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-ingress-from-ingress-controller
  namespace: ml-workloads
spec:
  podSelector:
    matchLabels:
      app: ml-inference
  policyTypes:
  - Ingress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: ingress-nginx # Assuming your ingress controller pods have this label

This policy allows traffic only from the Nginx Ingress controller pods to your ML inference pods. The IP whitelisting on the Ingress resource itself adds another layer.

The most surprising detail is how easily a compromised pod can become a gateway to your entire cloud environment if egress isn’t strictly controlled. Many developers assume that if a pod can’t be reached from the outside, it’s safe. But the outbound connections are often the overlooked attack vector.

Once you’ve locked down egress to S3, you might find your ML jobs failing because they also need to reach a downstream API for feature enrichment.