Recently, I came across this research paper published on July 2025. As a software architect, though most of the complexity around orchestrating the applications are abstracted, we always face issues with Kubernetes due to its workload. I have been exploring on how the Kubernetes can be leveraged using the latest developments in AI world. This paper interested me because it uses Reinforcement Learning reward mechanism to update the policy that automatically takes decisions in the Kubernetes like auto scaling, pod restarts using the Kubernetes inbuilt API.
Let us deep dive into what this paper explores and talks about.
The main challenge that this paper tries to solve is the Horizontal Pod Autoscalar which scales the pods based on the resource usage and the traffic sometimes fails due to busty traffic as it takes some time to scale and also it lacks the integration with GPU level metrics. To overcome these challenges, the team has come up with Kubernetes Inference Simulator that combines with KIScalar which is Proximal Policy Optimization (PPO) based auto scalar. KISim gets observes system metrics using Prometheus and adjusts the replica count using Kubernetes API. The evaluation was done on 4 types of traffic patterns — ramp, periodic, random and spike. The results are promising that the p95 latency was reduced upto 6.7x over CPU only baselines and improved GPU utilization.
By continuously interacting with the environment, an RL based autoscalar can learn multi- objective scheduling strategies tailored to workload dynamics. KIS-S consists of KISim which is Simulator that emulates AI workloads using real hardware with Prometheus integration and KIScalar a RL autoscalar is trained entirely on simulation and deployed directly in production without retraining.
- Workload generator that produces synthetic traffic patterns — ramp, periodic, random and spike.
- KISim models GPU aware Kubernetes scheduling
- Prometheus and DCGM Exporter for system and GPU metric collection
- KIScalar adjusts replica counts via the kubernetes API based on observed system state.
The reward function is carefully designed to balance 2 competing objectives
- Minimizing P95 latency to preserve user experience
- Maximizing GPU utilization to ensure resource efficiency
- Minimizing scaling frequency to reduce system overhead.
System Architecture

Here is the system architecture where you can find 5 main components
- Workload — This is the simulator that simulates different traffic patterns randomly
- KISim — This is the inference environment where the GPU workload is run
- Monitoring System — Prometheus collects the metrics from controller manager of Control Plane and provides the metrics to the RL environment.
- Control Plane — This is the kubernetes environment which is the master and controls all the other worker nodes which has kubernetes API and manager
- KIScalar — The metrics stored in the RL env is taken by the PPO agent and based on the environment the action is taken based on the trained results which calls the Kubernetes API to adjust the replica count.
All experiments are conducted in the Single node kubernetes cluster provisioned via MicroK8S with GPU scheduling enabled through the NVIDIA container runtime and GPU operator. Each deployment includes 3 GPU serving replicas, 3 CPU serving replicas and 3 Redis instances to simulate heterogenous workloads.
The system performance was evaluated using three categories of metrics:
- Inference metrics, including 95th percentile (P95) latency and throughput;
- Resource efficiency, measured by CPU, memory, and GPU utilization;
- Scheduling behavior, captured by scheduling latency (time from pod creation to placement), pod distribution across logical partitions, and resource allocation gap (difference between requested and actual usage).
The results were as shown below

Key findings
- GPU > CPU for irregular/random traffic; spike/ramp show 1.27× / 1.16× speedup for GPU over CPU under baselines.
- KIScaler (after training in simulation only) reduces P95 latency up to 6.7× vs CPU (ramp), and 2.6× / 5.1× vs GPU/CPU (random); periodic ≈ 2.3× over both; spike has limited headroom.
- Resource insights: GPU-serving pods consume more CPU cycles (42–72 m vs 715–1062 m for CPU pods. actually GPU pods used 773–876 m CPU vs CPU pods 715–1062 m; memory lower for CPU pods) and all pods scheduled <1 s — the bottleneck is scaling policy, not placement.
My implementation of the paper
I tried to implement my own version of it simulating the traffic patterns. But due to the GPU workload limitations, I have run the simplified version on the google colab
You can find the copy here
- First we would be simulating the traffic
import numpy as np
def traffic_ramp(t, T, low=80, high=800):
return np.linspace(low, high, T)[t]
def traffic_periodic(t, T, base=400, amp=250, cycles=3):
return max(1.0, base + amp*np.sin(2*np.pi*cycles*t/T))
def traffic_random(t, T, base=350, noise=200):
return max(1.0, base + np.random.randn()*noise)
def traffic_spike(t, T, base=250, spike_amp=1200, spike_prob=0.06):
val = base + np.random.randn()*70
if np.random.rand() < spike_prob:
val += spike_amp
return max(1.0, val)
PATTERNS = ["ramp","periodic","random","spike"]
def draw_load(pattern, t, T):
if pattern=="ramp": return traffic_ramp(t,T)
if pattern=="periodic":return traffic_periodic(t,T)
if pattern=="random": return traffic_random(t,T)
if pattern=="spike": return traffic_spike(t,T)
raise ValueError(pattern)
Let us create a minimal KIS-Sim Environment
import gymnasium as gym
from gymnasium import spaces
class KISimEnv(gym.Env):
metadata = {"render_modes": []}
def __init__(self,
episode_len=240,
max_cpu=6,
max_gpu=3,
gpu_cap_per_replica=520, # "rps" capacity per GPU replica
cpu_cap_per_replica=130, # "rps" capacity per CPU replica
base_latency_cpu_ms=18.0,
base_latency_gpu_ms=6.0,
alpha=1.0, beta=0.6, gamma=0.25,
seed=None):
super().__init__()
self.rng = np.random.default_rng(seed)
self.episode_len = episode_len
self.max_cpu = max_cpu
self.max_gpu = max_gpu
self.gpu_cap = gpu_cap_per_replica
self.cpu_cap = cpu_cap_per_replica
self.base_cpu = base_latency_cpu_ms
self.base_gpu = base_latency_gpu_ms
self.alpha, self.beta, self.gamma = alpha, beta, gamma
# Multi-discrete: ΔGPU∈{-2..+2}, ΔCPU∈{-2..+2}, pref∈{0,1}
self.action_space = spaces.MultiDiscrete(np.array([5,5,2], dtype=np.int64))
# Observation: 10-dim, each ~[0,1] after internal scaling
high = np.ones(10, dtype=np.float32)
self.observation_space = spaces.Box(low=0.0, high=high, shape=(10,), dtype=np.float32)
self.reset(seed=seed)
def _choose_pattern(self):
# Mix patterns across episodes, like the paper
return self.rng.choice(PATTERNS)
def _util_to_mem(self, cpu_util, gpu_util):
# crude: memory rises with replicas; util adds noise
mem = 0.10 + 0.06*self.cpu_repl + 0.10*self.gpu_repl + 0.05*(cpu_util+gpu_util)
return float(np.clip(mem, 0.0, 1.0))
def _p95_ms(self, load_cpu, load_gpu, cap_cpu, cap_gpu):
# Queueing-like growth as rho→1, with different base latencies
eps = 1e-6
rho_cpu = load_cpu/(cap_cpu+eps) if cap_cpu>0 else 0.0
rho_gpu = load_gpu/(cap_gpu+eps) if cap_gpu>0 else 0.0
mult_cpu = 1.0 / (1.0 - min(0.995, rho_cpu + 0.02))**2
mult_gpu = 1.0 / (1.0 - min(0.995, rho_gpu + 0.02))**2
# weighted by share of served requests
served = max(eps, load_cpu + load_gpu)
w_cpu = load_cpu/served
w_gpu = load_gpu/served
return w_cpu*self.base_cpu*mult_cpu + w_gpu*self.base_gpu*mult_gpu
def reset(self, *, seed=None, options=None):
if seed is not None:
self.rng = np.random.default_rng(seed)
self.t = 0
self.pattern = self._choose_pattern()
self.cpu_repl = 2
self.gpu_repl = 0 # start CPU-only
self.prev_latency = 0.050
self.prev_throughput = 0.0
obs = self._observe(throughput=0.0, p95=0.050, cpu_util=0.0, gpu_util=0.0)
return obs, {}
def _observe(self, throughput, p95, cpu_util, gpu_util):
# normalize to [0,1] ranges
act_repl = (self.cpu_repl + self.gpu_repl) / (self.max_cpu + self.max_gpu)
p95_norm = float(np.tanh(p95/300.0)) # 300ms scale
thr_norm = float(np.tanh(throughput/800.0))
cpu_u = float(np.clip(cpu_util, 0, 1))
gpu_u = float(np.clip(gpu_util, 0, 1))
dlat = float(np.clip((p95 - self.prev_latency)/120.0 + 0.5, 0, 1))
dthr = float(np.clip((throughput - self.prev_throughput)/400.0 + 0.5, 0, 1))
time_norm = self.t / self.episode_len
pat_id = PATTERNS.index(self.pattern) / (len(PATTERNS)-1) # 0..1
mem_util = self._util_to_mem(cpu_u, gpu_u)
obs = np.array([act_repl, gpu_u, p95_norm, thr_norm, cpu_u, mem_util,
dlat, dthr, time_norm, pat_id], dtype=np.float32)
return obs
def step(self, action):
# decode action
d_gpu = int(action[0]) - 2
d_cpu = int(action[1]) - 2
pref = int(action[2]) # 0=CPU-first, 1=GPU-first
# apply scaling
self.cpu_repl = int(np.clip(self.cpu_repl + d_cpu, 0, self.max_cpu))
self.gpu_repl = int(np.clip(self.gpu_repl + d_gpu, 0, self.max_gpu))
# incoming load (qps)
L = draw_load(self.pattern, self.t, self.episode_len)
cap_cpu = self.cpu_repl * self.cpu_cap
cap_gpu = self.gpu_repl * self.gpu_cap
# placement preference: fill preferred partition first
load_cpu = load_gpu = 0.0
if pref==1: # GPU-first
take_gpu = min(L, cap_gpu)
load_gpu = take_gpu
load_cpu = min(L - take_gpu, cap_cpu)
else: # CPU-first
take_cpu = min(L, cap_cpu)
load_cpu = take_cpu
load_gpu = min(L - take_cpu, cap_gpu)
served = load_cpu + load_gpu
throughput = served # rps
# utils (0..1)
cpu_util = (load_cpu/cap_cpu) if cap_cpu>0 else 0.0
gpu_util = (load_gpu/cap_gpu) if cap_gpu>0 else 0.0
p95 = self._p95_ms(load_cpu, load_gpu, cap_cpu, cap_gpu)
# reward: lower p95, prefer GPU util, penalize replicas
over = 0.3*self.cpu_repl + 0.6*self.gpu_repl # GPUs "cost" more
reward = -self.alpha*(p95/1000.0) + self.beta*gpu_util - self.gamma*(over/10.0)
obs = self._observe(throughput, p95, cpu_util, gpu_util)
self.prev_latency = p95
self.prev_throughput = throughput
self.t += 1
terminated = (self.t >= self.episode_len)
truncated = False
info = {"p95_ms": p95, "throughput": throughput, "cpu_util": cpu_util, "gpu_util": gpu_util, "load": L}
return obs, float(reward), terminated, truncated, info
This is a custom Gymnasium environment that simulates a tiny “cluster” with CPU and GPU replicas. At each time step, an RL agent decides:
- how many CPU replicas to add/remove,
- how many GPU replicas to add/remove,
- whether to place incoming load CPU-first or GPU-first.
The env returns a 10-value state, a reward that balances latency vs. cost, and a done flag when the episode ends.
- Episode length (episode_len): number of time steps before the episode ends.
- Replica limits (max_cpu, max_gpu): caps on how many CPU/GPU pods you can run.
- Per-replica capacity (cpu_cap_per_replica, gpu_cap_per_replica): how many requests/sec one CPU/GPU replica can handle.
- Base latencies (base_latency_cpu_ms, base_latency_gpu_ms): low-load p95 latency for CPU/GPU serving.
Reward weights (alpha, beta, gamma):
- alpha: penalize high latency,
- beta: reward using the GPU efficiently,
- gamma: penalize running many replicas (esp. GPU).
It also:
- creates a random number generator (self.rng),
- defines the action space and observation space,
- calls self.reset(...) to initialize state.
- A toy memory-usage model: more replicas ⇒ more memory; higher utilization adds a bit. Clipped to [0,1] because the observation expects normalized values.
- Calculates p95 latency for the current step using a queueing-style formula: Compute utilization ρ for CPU and GPU: ρ = load / capacity.Turn each ρ into a latency multiplier that blows up as ρ→1.Combine CPU/GPU latencies using weights proportional to who served more requests.
Action Space -MultiDiscrete([5,5,2]) → three choices each step:
ΔGPU ∈ {−2,−1,0,+1,+2}, 2) ΔCPU ∈ {−2,−1,0,+1,+2}, 3) preference ∈ {CPU-first, GPU-first} (how to place incoming load)
Reward-reward = −α·p95(ms)/1000 + β·gpu_util − γ·replica_cost
(penalize latency, mildly reward using GPU efficiently, penalize running many replicas—GPUs “cost” more).
- _choose_pattern() picks the load pattern for the episode.
- _util_to_mem(...) makes a simple memory-usage estimate from replica counts and utilization.
- _p95_ms(...) computes p95 latency with a queueing-style blow-up as utilization → 1.
- _observe(...) builds the 10-D normalized state.
Main loop (step(action)):
- Decode action → adjust CPU/GPU replica counts within bounds.
- Get current load from the chosen traffic pattern.
- Compute CPU/GPU capacity from replicas.
- Place load per preference (fill preferred side, spill to the other)
- Compute throughput, CPU/GPU utilization, p95 latency.
- Compute reward, build next observation, advance time
- Return (obs, reward, terminated, truncated, info) where info has raw metrics (p95_ms, throughput, utilizations, load).
Let us plot the sanity test
import matplotlib.pyplot as plt
env = KISimEnv(episode_len=240, seed=42)
obs, _ = env.reset()
lat, thru, load = [], [], []
for _ in range(env.episode_len):
a = env.action_space.sample()
obs, r, done, trunc, info = env.step(a)
lat.append(info["p95_ms"]); thru.append(info["throughput"]); load.append(info["load"])
if done: break
plt.figure(); plt.title("Sanity: load vs throughput"); plt.plot(load, label="load"); plt.plot(thru, label="throughput"); plt.legend()
plt.figure(); plt.title("Sanity: P95 latency (ms)"); plt.plot(lat); plt.show()


Train the PPO auto scalar
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
train_env = make_vec_env(lambda: KISimEnv(episode_len=240), n_envs=4) # vectorized for speed
model = PPO("MlpPolicy", train_env, verbose=0, n_steps=512, batch_size=1024, learning_rate=3e-4, gamma=0.995, gae_lambda=0.95, ent_coef=0.01, tensorboard_log=None)
model.learn(total_timesteps=120_000) # ~a few minutes on Colab CPU
model.save("ppo_kissim.zip")
Now that the model is trained. Time to evaluate the PPO against the baseline
def eval_policy(policy=None, episodes=8):
results = []
for ep in range(episodes):
env = KISimEnv(episode_len=240, seed=ep*7+1)
obs, _ = env.reset()
p95s=[]; thr=[]; loads=[]
while True:
if policy is None:
# CPU-first threshold baseline: if p95>150ms → add CPU; if load>cap_cpu*0.8 → add "GPU"
act = np.array([2, 2, 0]) # Δ=0, Δ=0, pref=CPU-first
# peek at obs: index 2 ~ p95_norm
# simple heuristic using last info is cleaner; we’ll probe env internals via info after one no-op
else:
act, _ = policy.predict(obs, deterministic=True)
obs, r, done, trunc, info = env.step(act)
# crude HPA-ish for the baseline after we have info:
if policy is None:
# if p95 high → scale CPU; if throughput < load*0.9 and CPU near cap → add "GPU"
if info["p95_ms"] > 150 and env.cpu_repl < env.max_cpu:
env.cpu_repl = min(env.max_cpu, env.cpu_repl + 1)
elif info["load"] > (env.cpu_repl*env.cpu_cap*0.9) and env.gpu_repl < 1:
env.gpu_repl = min(env.max_gpu, env.gpu_repl + 1)
p95s.append(info["p95_ms"]); thr.append(info["throughput"]); loads.append(info["load"])
if done: break
results.append((np.array(p95s), np.array(thr), np.array(loads)))
return results
ppo_results = eval_policy(policy=model, episodes=8)
base_results = eval_policy(policy=None, episodes=8)
def summarize(results, name):
p95_all = np.concatenate([r[0] for r in results])
thr_all = np.concatenate([r[1] for r in results])
ld_all = np.concatenate([r[2] for r in results])
print(f"{name}: P95(ms) mean={p95_all.mean():.1f}, median={np.median(p95_all):.1f}, 95th={np.percentile(p95_all,95):.1f}")
print(f"{name}: Throughput mean={thr_all.mean():.1f} rps | Mean load={ld_all.mean():.1f} rps")
summarize(ppo_results, "PPO")
summarize(base_results,"Baseline")

As you see there is a significant improvement in the PPO P95 and throughput.
This paper is interesting and gives good starting point to how AI can be utilized in our architecture to better utilize our resources.
Happy Learning!!
