Run AI Inference at the Edge with Cloudflare Workers AI (2026)

Cloudflare Workers AI lets you run machine learning models directly on Cloudflare’s edge network, bringing inference closer to your users and reducing latency.

Let’s see it in action. Imagine you want to classify images uploaded by your users. You can use a pre-trained model like @cf/meta/llama-2-7b-chat-fp16 to perform this task.

Here’s a simplified example of a Worker script that takes an image and sends it to the model for classification:

export default {
  async fetch(request, env, ctx) {
    if (request.method === 'POST') {
      const image = await request.blob();

      const response = await env.AI.run(
        '@cf/meta/llama-2-7b-chat-fp16',
        {
          prompt: `Classify this image: ${image}`,
        }
      );

      return new Response(JSON.stringify(response), {
        headers: { 'content-type': 'application/json' },
      });
    }

    return new Response('Send a POST request with an image to classify.', { status: 405 });
  },
};

This script, when deployed as a Cloudflare Worker, will listen for POST requests. It expects to receive an image as part of the request body. It then passes this image, along with a prompt, to the AI.run method. This method is the core of Workers AI, allowing you to invoke deployed models. The result, which will be the model’s classification, is then returned as a JSON response.

The power of Workers AI lies in its distributed nature. When a user makes a request to your Worker, it’s routed to the nearest Cloudflare data center. The inference happens there, on that edge server, not in a distant cloud region. This dramatically cuts down the round-trip time for requests, which is critical for real-time applications, interactive experiences, and any scenario where low latency is paramount. You’re not just running code at the edge; you’re running complex ML models at the edge.

The primary problem Workers AI solves is the latency and cost associated with sending data to a centralized cloud for inference and then back. For many applications, especially those dealing with real-time video, audio, or image processing, this round-trip can be prohibitively slow. By moving inference to the edge, close to where the data is generated, you eliminate that bottleneck. Furthermore, it can be more cost-effective than provisioning and managing dedicated GPU instances in the cloud, especially for bursty or unpredictable workloads, as you pay for what you use.

Internally, Cloudflare manages the deployment and scaling of these AI models across their global network. When you specify a model like @cf/meta/llama-2-7b-chat-fp16 in your AI.run call, Cloudflare routes that request to a Worker that has access to that model. The model itself is optimized to run efficiently on the hardware available at the edge. You don’t need to worry about provisioning GPUs, managing container orchestration, or dealing with the complexities of distributed ML inference. Cloudflare handles all of that infrastructure heavy lifting.

You control the inference process through the AI.run method. The first argument is the model ID, which Cloudflare provides for various pre-trained models. The second argument is an object containing the prompt and any other parameters the specific model expects. For text-generation models, the prompt is the input text. For image models, you might pass the image data directly or as a reference. You can also fine-tune models and deploy them yourself, giving you more control over performance and specific use cases.

One of the most overlooked aspects of edge AI inference is model quantization. While Cloudflare provides pre-quantized models for faster inference, understanding how quantization affects accuracy is key to balancing performance and precision. For instance, a model quantized to 4-bit precision will be significantly faster and smaller than its FP16 counterpart, but there’s a non-trivial chance of accuracy degradation. You can often experiment with different quantization levels or even re-quantize models yourself to find the optimal sweet spot for your specific application’s tolerance for error versus its need for speed.

The next step after successfully running inference at the edge is to explore fine-tuning models with your own data for custom tasks.