Shilpa Blog

Recently, I came across this research paper published on July 2025. As a software architect, though most of the complexity around orchestrating the applications are abstracted, we always face issues with Kubernetes due to its workload. I have been exploring on how the Kubernetes can be leveraged using the latest developments in AI world. This paper interested me because it uses Reinforcement Learning reward mechanism to update the policy that automatically takes decisions in the Kubernetes like auto scaling, pod restarts using the Kubernetes inbuilt API.

Let us deep dive into what this paper explores and talks about.

The main challenge that this paper tries to solve is the Horizontal Pod Autoscalar which scales the pods based on the resource usage and the traffic sometimes fails due to busty traffic as it takes some time to scale and also it lacks the integration with GPU level metrics. To overcome these challenges, the team has come up with Kubernetes Inference Simulator that combines with KIScalar which is Proximal Policy Optimization (PPO) based auto scalar. KISim gets observes system metrics using Prometheus and adjusts the replica count using Kubernetes API. The evaluation was done on 4 types of traffic patterns — ramp, periodic, random and spike. The results are promising that the p95 latency was reduced upto 6.7x over CPU only baselines and improved GPU utilization.

By continuously interacting with the environment, an RL based autoscalar can learn multi- objective scheduling strategies tailored to workload dynamics. KIS-S consists of KISim which is Simulator that emulates AI workloads using real hardware with Prometheus integration and KIScalar a RL autoscalar is trained entirely on simulation and deployed directly in production without retraining.

Workload generator that produces synthetic traffic patterns — ramp, periodic, random and spike.
KISim models GPU aware Kubernetes scheduling
Prometheus and DCGM Exporter for system and GPU metric collection
KIScalar adjusts replica counts via the kubernetes API based on observed system state.

The reward function is carefully designed to balance 2 competing objectives

Minimizing P95 latency to preserve user experience
Maximizing GPU utilization to ensure resource efficiency
Minimizing scaling frequency to reduce system overhead.

System Architecture

Here is the system architecture where you can find 5 main components

Workload — This is the simulator that simulates different traffic patterns randomly
KISim — This is the inference environment where the GPU workload is run
Monitoring System — Prometheus collects the metrics from controller manager of Control Plane and provides the metrics to the RL environment.
Control Plane — This is the kubernetes environment which is the master and controls all the other worker nodes which has kubernetes API and manager
KIScalar — The metrics stored in the RL env is taken by the PPO agent and based on the environment the action is taken based on the trained results which calls the Kubernetes API to adjust the replica count.

All experiments are conducted in the Single node kubernetes cluster provisioned via MicroK8S with GPU scheduling enabled through the NVIDIA container runtime and GPU operator. Each deployment includes 3 GPU serving replicas, 3 CPU serving replicas and 3 Redis instances to simulate heterogenous workloads.

The system performance was evaluated using three categories of metrics:

Inference metrics, including 95th percentile (P95) latency and throughput;
Resource efficiency, measured by CPU, memory, and GPU utilization;
Scheduling behavior, captured by scheduling latency (time from pod creation to placement), pod distribution across logical partitions, and resource allocation gap (difference between requested and actual usage).

The results were as shown below

Key findings

GPU > CPU for irregular/random traffic; spike/ramp show 1.27× / 1.16× speedup for GPU over CPU under baselines.
KIScaler (after training in simulation only) reduces P95 latency up to 6.7× vs CPU (ramp), and 2.6× / 5.1× vs GPU/CPU (random); periodic ≈ 2.3× over both; spike has limited headroom.
Resource insights: GPU-serving pods consume more CPU cycles (42–72 m vs 715–1062 m for CPU pods. actually GPU pods used 773–876 m CPU vs CPU pods 715–1062 m; memory lower for CPU pods) and all pods scheduled <1 s — the bottleneck is scaling policy, not placement.

My implementation of the paper

I tried to implement my own version of it simulating the traffic patterns. But due to the GPU workload limitations, I have run the simplified version on the google colab

You can find the copy here

First we would be simulating the traffic

import numpy as np

def traffic_ramp(t, T, low=80, high=800):
    return np.linspace(low, high, T)[t]

def traffic_periodic(t, T, base=400, amp=250, cycles=3):
    return max(1.0, base + amp*np.sin(2*np.pi*cycles*t/T))

def traffic_random(t, T, base=350, noise=200):
    return max(1.0, base + np.random.randn()*noise)

def traffic_spike(t, T, base=250, spike_amp=1200, spike_prob=0.06):
    val = base + np.random.randn()*70
    if np.random.rand() < spike_prob:
        val += spike_amp
    return max(1.0, val)

PATTERNS = ["ramp","periodic","random","spike"]
def draw_load(pattern, t, T):
    if pattern=="ramp":    return traffic_ramp(t,T)
    if pattern=="periodic":return traffic_periodic(t,T)
    if pattern=="random":  return traffic_random(t,T)
    if pattern=="spike":   return traffic_spike(t,T)
    raise ValueError(pattern)

Let us create a minimal KIS-Sim Environment

import gymnasium as gym
from gymnasium import spaces

class KISimEnv(gym.Env):
    metadata = {"render_modes": []}
    def __init__(self,
                 episode_len=240,
                 max_cpu=6,
                 max_gpu=3,
                 gpu_cap_per_replica=520,   # "rps" capacity per GPU replica
                 cpu_cap_per_replica=130,   # "rps" capacity per CPU replica
                 base_latency_cpu_ms=18.0,
                 base_latency_gpu_ms=6.0,
                 alpha=1.0, beta=0.6, gamma=0.25,
                 seed=None):
        super().__init__()
        self.rng = np.random.default_rng(seed)
        self.episode_len = episode_len
        self.max_cpu = max_cpu
        self.max_gpu = max_gpu
        self.gpu_cap = gpu_cap_per_replica
        self.cpu_cap = cpu_cap_per_replica
        self.base_cpu = base_latency_cpu_ms
        self.base_gpu = base_latency_gpu_ms
        self.alpha, self.beta, self.gamma = alpha, beta, gamma

        # Multi-discrete: ΔGPU∈{-2..+2}, ΔCPU∈{-2..+2}, pref∈{0,1}
        self.action_space = spaces.MultiDiscrete(np.array([5,5,2], dtype=np.int64))

        # Observation: 10-dim, each ~[0,1] after internal scaling
        high = np.ones(10, dtype=np.float32)
        self.observation_space = spaces.Box(low=0.0, high=high, shape=(10,), dtype=np.float32)

        self.reset(seed=seed)

    def _choose_pattern(self):
        # Mix patterns across episodes, like the paper
        return self.rng.choice(PATTERNS)

    def _util_to_mem(self, cpu_util, gpu_util):
        # crude: memory rises with replicas; util adds noise
        mem = 0.10 + 0.06*self.cpu_repl + 0.10*self.gpu_repl + 0.05*(cpu_util+gpu_util)
        return float(np.clip(mem, 0.0, 1.0))

    def _p95_ms(self, load_cpu, load_gpu, cap_cpu, cap_gpu):
        # Queueing-like growth as rho→1, with different base latencies
        eps = 1e-6
        rho_cpu = load_cpu/(cap_cpu+eps) if cap_cpu>0 else 0.0
        rho_gpu = load_gpu/(cap_gpu+eps) if cap_gpu>0 else 0.0
        mult_cpu = 1.0 / (1.0 - min(0.995, rho_cpu + 0.02))**2
        mult_gpu = 1.0 / (1.0 - min(0.995, rho_gpu + 0.02))**2
        # weighted by share of served requests
        served = max(eps, load_cpu + load_gpu)
        w_cpu = load_cpu/served
        w_gpu = load_gpu/served
        return w_cpu*self.base_cpu*mult_cpu + w_gpu*self.base_gpu*mult_gpu

    def reset(self, *, seed=None, options=None):
        if seed is not None:
            self.rng = np.random.default_rng(seed)
        self.t = 0
        self.pattern = self._choose_pattern()
        self.cpu_repl = 2
        self.gpu_repl = 0  # start CPU-only
        self.prev_latency = 0.050
        self.prev_throughput = 0.0
        obs = self._observe(throughput=0.0, p95=0.050, cpu_util=0.0, gpu_util=0.0)
        return obs, {}

    def _observe(self, throughput, p95, cpu_util, gpu_util):
        # normalize to [0,1] ranges
        act_repl = (self.cpu_repl + self.gpu_repl) / (self.max_cpu + self.max_gpu)
        p95_norm = float(np.tanh(p95/300.0))  # 300ms scale
        thr_norm = float(np.tanh(throughput/800.0))
        cpu_u = float(np.clip(cpu_util, 0, 1))
        gpu_u = float(np.clip(gpu_util, 0, 1))
        dlat = float(np.clip((p95 - self.prev_latency)/120.0 + 0.5, 0, 1))
        dthr = float(np.clip((throughput - self.prev_throughput)/400.0 + 0.5, 0, 1))
        time_norm = self.t / self.episode_len
        pat_id = PATTERNS.index(self.pattern) / (len(PATTERNS)-1)  # 0..1
        mem_util = self._util_to_mem(cpu_u, gpu_u)
        obs = np.array([act_repl, gpu_u, p95_norm, thr_norm, cpu_u, mem_util,
                        dlat, dthr, time_norm, pat_id], dtype=np.float32)
        return obs

    def step(self, action):
        # decode action
        d_gpu = int(action[0]) - 2
        d_cpu = int(action[1]) - 2
        pref  = int(action[2])      # 0=CPU-first, 1=GPU-first

        # apply scaling
        self.cpu_repl = int(np.clip(self.cpu_repl + d_cpu, 0, self.max_cpu))
        self.gpu_repl = int(np.clip(self.gpu_repl + d_gpu, 0, self.max_gpu))

        # incoming load (qps)
        L = draw_load(self.pattern, self.t, self.episode_len)

        cap_cpu = self.cpu_repl * self.cpu_cap
        cap_gpu = self.gpu_repl * self.gpu_cap

        # placement preference: fill preferred partition first
        load_cpu = load_gpu = 0.0
        if pref==1:  # GPU-first
            take_gpu = min(L, cap_gpu)
            load_gpu = take_gpu
            load_cpu = min(L - take_gpu, cap_cpu)
        else:        # CPU-first
            take_cpu = min(L, cap_cpu)
            load_cpu = take_cpu
            load_gpu = min(L - take_cpu, cap_gpu)

        served = load_cpu + load_gpu
        throughput = served  # rps

        # utils (0..1)
        cpu_util = (load_cpu/cap_cpu) if cap_cpu>0 else 0.0
        gpu_util = (load_gpu/cap_gpu) if cap_gpu>0 else 0.0

        p95 = self._p95_ms(load_cpu, load_gpu, cap_cpu, cap_gpu)

        # reward: lower p95, prefer GPU util, penalize replicas
        over = 0.3*self.cpu_repl + 0.6*self.gpu_repl  # GPUs "cost" more
        reward = -self.alpha*(p95/1000.0) + self.beta*gpu_util - self.gamma*(over/10.0)

        obs = self._observe(throughput, p95, cpu_util, gpu_util)
        self.prev_latency = p95
        self.prev_throughput = throughput

        self.t += 1
        terminated = (self.t >= self.episode_len)
        truncated = False
        info = {"p95_ms": p95, "throughput": throughput, "cpu_util": cpu_util, "gpu_util": gpu_util, "load": L}
        return obs, float(reward), terminated, truncated, info

This is a custom Gymnasium environment that simulates a tiny “cluster” with CPU and GPU replicas. At each time step, an RL agent decides:

how many CPU replicas to add/remove,
how many GPU replicas to add/remove,
whether to place incoming load CPU-first or GPU-first.

The env returns a 10-value state, a reward that balances latency vs. cost, and a done flag when the episode ends.

Episode length (episode_len): number of time steps before the episode ends.
Replica limits (max_cpu, max_gpu): caps on how many CPU/GPU pods you can run.
Per-replica capacity (cpu_cap_per_replica, gpu_cap_per_replica): how many requests/sec one CPU/GPU replica can handle.
Base latencies (base_latency_cpu_ms, base_latency_gpu_ms): low-load p95 latency for CPU/GPU serving.

Reward weights (alpha, beta, gamma):

alpha: penalize high latency,
beta: reward using the GPU efficiently,
gamma: penalize running many replicas (esp. GPU).

It also:

creates a random number generator (self.rng),
defines the action space and observation space,
calls self.reset(...) to initialize state.
A toy memory-usage model: more replicas ⇒ more memory; higher utilization adds a bit. Clipped to [0,1] because the observation expects normalized values.
Calculates p95 latency for the current step using a queueing-style formula: Compute utilization ρ for CPU and GPU: ρ = load / capacity.Turn each ρ into a latency multiplier that blows up as ρ→1.Combine CPU/GPU latencies using weights proportional to who served more requests.

Action Space -MultiDiscrete([5,5,2]) → three choices each step:

ΔGPU ∈ {−2,−1,0,+1,+2}, 2) ΔCPU ∈ {−2,−1,0,+1,+2}, 3) preference ∈ {CPU-first, GPU-first} (how to place incoming load)

Reward-reward = −α·p95(ms)/1000 + β·gpu_util − γ·replica_cost
(penalize latency, mildly reward using GPU efficiently, penalize running many replicas—GPUs “cost” more).

_choose_pattern() picks the load pattern for the episode.
_util_to_mem(...) makes a simple memory-usage estimate from replica counts and utilization.
_p95_ms(...) computes p95 latency with a queueing-style blow-up as utilization → 1.
_observe(...) builds the 10-D normalized state.

Main loop (step(action)):

Decode action → adjust CPU/GPU replica counts within bounds.
Get current load from the chosen traffic pattern.
Compute CPU/GPU capacity from replicas.
Place load per preference (fill preferred side, spill to the other)
Compute throughput, CPU/GPU utilization, p95 latency.
Compute reward, build next observation, advance time
Return (obs, reward, terminated, truncated, info) where info has raw metrics (p95_ms, throughput, utilizations, load).

Let us plot the sanity test


import matplotlib.pyplot as plt

env = KISimEnv(episode_len=240, seed=42)
obs, _ = env.reset()
lat, thru, load = [], [], []
for _ in range(env.episode_len):
    a = env.action_space.sample()
    obs, r, done, trunc, info = env.step(a)
    lat.append(info["p95_ms"]); thru.append(info["throughput"]); load.append(info["load"])
    if done: break

plt.figure(); plt.title("Sanity: load vs throughput"); plt.plot(load, label="load"); plt.plot(thru, label="throughput"); plt.legend()
plt.figure(); plt.title("Sanity: P95 latency (ms)"); plt.plot(lat); plt.show()

Train the PPO auto scalar


from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env

train_env = make_vec_env(lambda: KISimEnv(episode_len=240), n_envs=4)  # vectorized for speed
model = PPO("MlpPolicy", train_env, verbose=0, n_steps=512, batch_size=1024, learning_rate=3e-4, gamma=0.995, gae_lambda=0.95, ent_coef=0.01, tensorboard_log=None)
model.learn(total_timesteps=120_000)  # ~a few minutes on Colab CPU
model.save("ppo_kissim.zip")

Now that the model is trained. Time to evaluate the PPO against the baseline

def eval_policy(policy=None, episodes=8):
    results = []
    for ep in range(episodes):
        env = KISimEnv(episode_len=240, seed=ep*7+1)
        obs, _ = env.reset()
        p95s=[]; thr=[]; loads=[]
        while True:
            if policy is None:
                # CPU-first threshold baseline: if p95>150ms → add CPU; if load>cap_cpu*0.8 → add "GPU"
                act = np.array([2, 2, 0])  # Δ=0, Δ=0, pref=CPU-first
                # peek at obs: index 2 ~ p95_norm
                # simple heuristic using last info is cleaner; we’ll probe env internals via info after one no-op
            else:
                act, _ = policy.predict(obs, deterministic=True)

            obs, r, done, trunc, info = env.step(act)

            # crude HPA-ish for the baseline after we have info:
            if policy is None:
                # if p95 high → scale CPU; if throughput < load*0.9 and CPU near cap → add "GPU"
                if info["p95_ms"] > 150 and env.cpu_repl < env.max_cpu:
                    env.cpu_repl = min(env.max_cpu, env.cpu_repl + 1)
                elif info["load"] > (env.cpu_repl*env.cpu_cap*0.9) and env.gpu_repl < 1:
                    env.gpu_repl = min(env.max_gpu, env.gpu_repl + 1)

            p95s.append(info["p95_ms"]); thr.append(info["throughput"]); loads.append(info["load"])
            if done: break
        results.append((np.array(p95s), np.array(thr), np.array(loads)))
    return results

ppo_results = eval_policy(policy=model, episodes=8)
base_results = eval_policy(policy=None, episodes=8)

def summarize(results, name):
    p95_all = np.concatenate([r[0] for r in results])
    thr_all = np.concatenate([r[1] for r in results])
    ld_all  = np.concatenate([r[2] for r in results])
    print(f"{name}: P95(ms) mean={p95_all.mean():.1f}, median={np.median(p95_all):.1f}, 95th={np.percentile(p95_all,95):.1f}")
    print(f"{name}: Throughput mean={thr_all.mean():.1f} rps | Mean load={ld_all.mean():.1f} rps")

summarize(ppo_results, "PPO")
summarize(base_results,"Baseline")

As you see there is a significant improvement in the PPO P95 and throughput.

This paper is interesting and gives good starting point to how AI can be utilized in our architecture to better utilize our resources.

Happy Learning!!

GPU Aware Kubernetes Inference Simulator with Reinforcement Learning based Auto Scaling

System Architecture

My implementation of the paper

Shilpa Thota