Module 5: The 4.21 Horizon — AI & Autonomic Operations

Duration: 15 minutes
Type: Presentation Only

Learning Objectives

By the end of this module, you will:

  • Understand how the KubeVirtRelieveAndMigrate descheduler profile uses CPU Pressure Stall Information (PSI) to automate VM live migrations

  • Know how cross-cluster live migration enables zero-downtime VM movement between OpenShift clusters

  • Grasp how the Gateway API Inference Extension (GIE) and llm-d solve GPU-efficient LLM serving

  • Understand how IBM Content-Aware Storage turns passive storage into an active RAG pipeline participant

  • See how Dynamic Resource Allocation (DRA) replaces device plugins for intelligent GPU scheduling

Autonomic Virtualization: The Descheduler Evolution

What You Did Manually in Module 2

In Module 2, you manually initiated a live migration of your VM. You decided when and where to move the workload. In production, this manual approach doesn’t scale.

The KubeVirtRelieveAndMigrate Profile

OpenShift 4.20 introduced the KubeVirtRelieveAndMigrate descheduler profile, which uses real CPU utilization and Pressure Stall Information (PSI) metrics to automatically detect overloaded nodes and trigger live migrations. This profile is carried forward and refined in 4.21.

How Two-Phase Rebalancing Works:

  1. Phase 1 — CPU Utilization Balancing: The descheduler monitors actual CPU utilization across all nodes. When a node exceeds the cluster average by 10% or more, VMs are evicted from that overutilized node and migrated to underutilized nodes.

  2. Phase 2 — PSI-Driven Relief: When cluster-wide CPU utilization reaches 80%, raw utilization numbers become less meaningful. The descheduler switches to PSI CPU metrics, which quantify how much time workloads spend stalled waiting for CPU. VMs are moved from high-pressure nodes to lower-pressure nodes.

  3. When a VM pod is evicted, OpenShift Virtualization automatically triggers a live migration — the same operation you performed manually in Module 2.

Today (Manual):
  Admin notices node overloaded → Decides which VM to move → Triggers migration

4.21 (Autonomic):
  PSI metrics detect contention → Descheduler selects VM → Auto-triggers live migration

Prerequisites and Configuration:

PSI metrics must be enabled on all worker nodes via a MachineConfig with kernel argument psi=1. The profile also supports soft-tainting (devEnableSoftTainter), which dynamically applies scheduling hints at 60-second intervals so that new VMs are placed on less-loaded nodes from the start.

Measured Results:

In Red Hat testing with 130 VMs across a 12-worker cluster, CPU utilization standard deviation dropped from ~50% to 7%, and average vCPU wait time fell from over 100% to nearly zero.

Cross-Cluster Live Migration (Tech Preview)

OpenShift Virtualization 4.21 extends migration beyond a single cluster. Cross-cluster live migration enables zero-downtime VM movement between two OpenShift clusters connected on the same L2 network segment. This is enabled through feature gates in OpenShift Virtualization and the Migration Toolkit for Virtualization (MTV).

Use cases include:

  • Cluster maintenance — drain an entire cluster without VM downtime

  • Geographic load balancing — shift VMs to clusters closer to users

  • Disaster recovery preparation — pre-stage VMs on a standby cluster

Cross-cluster live migration is a Technology Preview feature in 4.21 and is not yet supported for production workloads.

AI Workload Optimization

The GPU Scheduling Problem

Standard HTTP load balancing distributes requests round-robin across replicas. This works well for stateless web applications, but fails catastrophically for Large Language Models (LLMs) because:

  • LLM inference is stateful — each request depends on a KV-cache loaded in GPU memory

  • GPU memory is scarce and expensive — a single A100 costs $10,000+

  • Token generation is sequential — you can’t parallelize a single request across GPUs

  • Batch size matters — GPU utilization plummets when batches are too small or too large

A naive round-robin distributes requests to GPUs that may not have the right model loaded, triggering expensive model swaps that waste GPU cycles.

Gateway API Inference Extension (GIE)

The Gateway API Inference Extension solves this by transforming any Envoy-based gateway into an inference-aware gateway. GIE went GA in OpenShift 4.20 (November 2025) and is fully supported in 4.21.

GIE introduces two Custom Resource Definitions:

  • InferencePool — defines a group of model-server pods on shared compute, with scaling and balancing policies

  • InferenceModel — maps a user-facing model name to backend models within an InferencePool, supporting traffic splitting between model versions or LoRA fine-tuned adapters

Inference-Aware Routing Strategies:

  • Adapter-aware routing — directs requests to backends that already have the required LoRA adapter loaded, avoiding expensive GPU memory swaps

  • Queue-aware scheduling — avoids overwhelmed backends by considering request queue depth

  • Prefix-cache-aware routing — computes a prefix hash for each prompt and routes to the pod already holding that prefix cached, maximizing KV-cache hit rates

  • Load shedding — safeguards system stability when all backends are heavily loaded

llm-d: Kubernetes-Native Distributed Inference

llm-d is a distributed LLM inference framework donated to the CNCF as a Sandbox project in March 2026 by IBM Research, Red Hat, and Google Cloud. Founding partners include NVIDIA, CoreWeave, AMD, Cisco, Hugging Face, Intel, Lambda, and Mistral AI.

llm-d sits between the inference engine (vLLM) and the orchestration layer (KServe), implementing the Gateway API Inference Extension. It addresses three production bottlenecks:

  • Resource imbalance — disaggregated serving separates Prefill (prompt processing) and Decode (token generation) into independently scalable pod pools

  • KV-cache waste — hierarchical cache offloading tiers data across GPU HBM → CPU DRAM → NVMe, plus prefix caching for multi-tenant environments

  • Routing inefficiency — the Endpoint Picker (EPP) replaces round-robin with prefix-cache-aware routing that maximizes cache hit rates

The project’s mission is "any model, any accelerator, any cloud" — providing accelerator-neutral, vendor-agnostic inference infrastructure.

IBM Content-Aware Storage (CAS)

Traditional storage is passive: it stores bytes and serves them on request. IBM Content-Aware Storage makes storage an active participant in AI pipelines by embedding NLP pipelines and a vector database directly into IBM Storage Scale.

How CAS Works:

  1. CAS applies NLP techniques to extract semantic meaning from unstructured content — text, charts, graphs, and images — converting them into dense vector embeddings

  2. It automatically monitors file changes across heterogeneous storage systems and detects modifications

  3. Only changed content is re-processed and updated in the vector database — no full retraining cycles required

  4. Applications query the embedded vector database for contextual, semantic retrieval that goes beyond keyword matching

Enterprise Integration:

IBM Fusion for Red Hat AI combines CAS with OpenShift AI, KServe/vLLM, and GPU-accelerated inference into a sovereign on-premises AI platform. Validated deployments on Fusion HCI with OpenShift AI enable direct GPU scheduling, low-latency storage access, and production-grade LLMOps.

On the BEIR benchmark suite, IBM Fusion CAS outperforms state-of-the-art information retrieval systems in question-answering accuracy through advanced embedding models, vector indexing, and re-ranking strategies.

Dynamic Resource Allocation (DRA) for GPUs

Dynamic Resource Allocation went GA in OpenShift 4.21, replacing the legacy device-plugin model for GPU scheduling. Built on Kubernetes 1.34’s upstream DRA implementation, it enables workloads to request GPUs based on device attributes rather than simple counts.

What DRA enables:

  • Expression-driven GPU selection — filter by model, memory capacity, compute capability, and driver version using CEL expressions

  • GPU sharing — multiple containers can share a single GPU

  • Topology awareness — respect PCIe, NVLink, and NUMA placement for optimal data transfer

  • MIG profiles — request specific Multi-Instance GPU partitions

  • Cluster autoscaler integration — provision new GPU nodes based on pending DRA claims

DRA works through four Kubernetes resources:

  • ResourceSlice — DRA drivers publish available devices with typed attributes

  • DeviceClass — admins define device categories using CEL selector expressions

  • ResourceClaim — workloads request specific devices

  • ResourceClaimTemplate — auto-generates per-pod claims

Queue Management with Kueue:

The Red Hat build of Kueue 1.2 complements DRA with hierarchical quota management and priority scheduling for AI training jobs. In production, IBM achieved 90% GPU utilization on its Vela platform using OpenShift with Kueue — up from the 30-50% typical of ML platforms.

Together, GIE + llm-d, CAS, and DRA + Kueue optimize the three most expensive resources in AI infrastructure: GPU compute, high-performance storage I/O, and GPU scheduling efficiency.

The Convergence Story

What you experienced today maps directly to what is now shipping in OpenShift 4.21:

Today (This Workshop) OpenShift 4.21 Benefit

Manual live migration

PSI-based KubeVirtRelieveAndMigrate + cross-cluster live migration

Autonomic VM operations within and across clusters

Manual backup triggers

CBT-powered incremental backups (GA in 4.21)

Storage-agnostic zero-touch data protection

Static storage classes

Content-Aware Storage with real-time vector sync

AI-optimized RAG pipelines

Standard HTTP routing

GIE (GA) + llm-d disaggregated serving

GPU-efficient LLM serving with cache-aware routing

Manual GPU allocation

DRA (GA) + Kueue 1.2 priority scheduling

Attribute-based GPU sharing with topology awareness

References

Platform

Facilitator Notes: This module covers four key themes. Adjust depth based on audience:

  1. Autonomic VM Operations — The KubeVirtRelieveAndMigrate descheduler profile automates exactly what attendees did manually in Module 2. Emphasize the two-phase approach (CPU utilization then PSI) and the measured results (std-dev from 50% to 7%). Mention cross-cluster live migration as the next frontier (Tech Preview).

  2. GPU-Efficient Inference — Standard load balancing fails for LLMs. GIE (GA since 4.20) adds inference-aware routing. llm-d (CNCF Sandbox, March 2026) builds on GIE with disaggregated serving and prefix-cache-aware routing. This is a joint IBM Research, Red Hat, and Google Cloud effort.

  3. Content-Aware Storage — CAS makes storage an active RAG participant rather than a passive byte store. Key message: storage that understands your data eliminates the semantic gap between where enterprise data lives and what AI needs.

  4. DRA for GPUs — GA in 4.21, replacing device plugins. Enables expression-driven GPU selection, sharing, and topology awareness. Pair with Kueue for queue management. IBM achieved 90% GPU utilization on its Vela platform.