NGINX Gateway Fabric Supports the Gateway API Inference Extension 

by

in

Running inference at scale introduces complexities that ordinary routing can’t resolve. LLMs and other generative AI workloads differ fundamentally from standard web services. With NGINX Gateway Fabric (NGF) version 2.2, organizations can now tap into the Gateway API Inference Extension to enable smart, inference-aware routing in Kubernetes. Platform and ML teams can publish self-hosted GenAI and LLM services and inference workloads with smarter decisions about routing and fine control over GPU and compute resource usage.

The Gateway API Inference Extension is a community-driven Kubernetes project that standardizes routing logic for inference workloads across the ecosystem. NGF 2.2 integrates with that extension, allowing NGINX to make routing decisions based on AI workload and model characteristics rather than generic traffic heuristics.

The Challenge: Traditional Gateways Are Insufficient for LLMs

NGINX already excels in microservice traffic management. But inference workloads come with new demands:

  • A single request can monopolize expensive GPU resources
  • Inference requests may take seconds (or more), not milliseconds
  • Request payloads vary in size and resource intensity
  • Model-serving behavior is stateful and dynamic: queue lengths shift, KV-cache usage fluctuates, and different adapters (e.g. LoRA) may or may not be loaded

In such an environment, routing decisions become critical. Sending a request to a node already saturated can stall an entire GPU, degrade throughput, and introduce latency. Popular NGINX load balancing algorithms like round robin or least_time are blind to model state and cannot optimize for inference. What is required instead is routing built for AI workloads that can respond to runtime signals and make intelligent decisions.

How the Gateway API Inference Extension Works

Rather than invent a new routing paradigm, the extension builds on the existing Gateway API model (Gateway, HTTPRoute, backendRefs) while adding inference-aware features.

Core Resource: InferencePool

The core CRD is InferencePool. This represents a set of model server pods (i.e. the inference backends). It also includes configuration indicating which Endpoint Picker (EPP) implementation to invoke for scheduling decisions.

When an HTTPRoute’s backendRef points to an InferencePool, gateways know to engage inference routing logic (rather than plain service routing).

Endpoint Picker (EPP)

The Endpoint Picker is a core component in the architecture of the Inference Extension. It acts as an intelligent scheduler. When a request arrives, the gateway sends metadata about candidate pods (e.g. queue lengths, KV cache usage, adapter presence) to the EPP, which then selects the “best” pod to handle that request.

Because the EPP is part of the community extension, various gateway implementations (including NGINX) can adopt it rather than invent new logic.

In the above example, the manifests show how NGINX Gateway Fabric routes inference traffic using the Gateway API Inference Extension. The HTTPRoute sends all incoming requests through the inference-gateway to an InferencePool called vitllama3-8b-instruct. That pool represents the model-serving pods. Instead of simple load balancing algorithms, NGF instead queries the Endpoint Picker (vitllama3-8b-instruct-epp) specified in the pool to choose the best pod based on real-time metrics. In short, this ensures that each request is directed to the most optimal model replica for performance and resource efficiency.

How NGINX Gateway Fabric Implements Inference Routing

When a request arrives at NGF, it is matched via HTTPRoute as usual. But if the HTTPRoute’s backendRef points to an InferencePool, NGF handles routing differently:

  1. NGF passes the request metadata and candidate backends to the EPP (via an ext_proc / external processing mechanism)
  2. The EPP chooses one endpoint (pod) from the InferencePool based on runtime metrics
  3. NGF routes traffic to that selected pod
  4. In this way, NGF becomes an Inference Gateway, enabling model-aware routing, rollout control, priority logic, and enhanced observability

Because NGF does not ship a proprietary EPP, it integrates the standard inference extension’s scheduling logic. This avoids reinvention and adheres to the community design.

NGF support for inference extension is toggled via configuration (e.g. enabling the Gateway API inference extension in Helm), along with CRDs for InferencePool and wiring the EPP.

One Gateway, Many Workloads

With inference support, NGINX Gateway Fabric serves as a unified gateway for APIs, microservices, and AI models. AI functionality is not siloed; LLMs coexist with traditional workloads under the same control plane.

Users can expose GenAI endpoints, manage traffic patterns (e.g. safe rollouts, priority models), and monitor inference metrics alongside their regular traffic observability stack.

Why This Matters

The Gateway API Inference Extension introduces a community standard for AI routing. NGF 2.2 aligns with that standard by integrating it rather than reinventing it.

Platform teams can stay inside Kubernetes and avoid fragmented toolchains, while ML teams benefit from routing optimized for latency, GPU efficiency, and adaptive scheduling. Ultimately, AI workloads can run with the operational maturity that teams expect of their web services.

Want to start working with the Gateway API Inference extension? Download the NGINX Gateway Fabric and jump in today!