Agentic AI for GPU Infrastructure

Detect. Analyse.
Alert. Remediate.

Full situational awareness for your GPU cluster.

GPUPilot monitors your entire Kubernetes GPU cluster — detects anomalies in real time, analyses root causes with AI, alerts your team, and remediates with one click.

Sign In See how it works
RKE2OpenShiftGKEEKSAKSVCFRancherK3sRun:aiDCGM

The Closed Loop

01

Connect

One kubectl command. Read-only agent. Deploys in 30 seconds.

02

Detect

30+ DCGM metrics, pods, logs, events. Every 30 seconds. Nothing missed.

03

Analyse

Claude AI investigates every anomaly. Correlates events, finds root cause.

04

Alert

Instant Slack with diagnosis, severity, and fix. Before you check the dashboard.

05

Remediate

One-click approve or auto-fix. Every action logged and auditable.

Deploy in 30 seconds

kubectl apply -f https://gpupilot.io/api/install/YOUR_TOKEN

Real-Time Detection & Remediation

Every anomaly triggers an AI investigation, a Slack alert, and a suggested fix — automatically.

🔴

XID Error Detected

GPU fault on dgxb200:GPU3. AI: driver mismatch after update.

Approve Fixkubectl drain dgxb200 --ignore-daemonsets
🔴

ECC Double-Bit Error

Uncorrectable fault on GPU 5. Row remaps: 3/4. Failure in ~2 weeks.

Approve RMAPre-emptive replacement
🟠

GPU Utilization Drop

87% to 12% in 5 minutes. Scheduler stuck on node affinity.

Auto-FixedAffinity relaxed, jobs rescheduled

Node Recovered

worker-07 back online. NVSwitch firmware updated. 4 jobs resumed.

ResolvedDowntime: 4m 12s

What GPUPilot Monitors

GPU UtilizationVRAM UsageTemperature Power DrawXID ErrorsECC (SBE/DBE) Row RemapsPCIe ReplaysNVLink Bandwidth SM/Mem ClocksRun:ai JobsPod Logs K8s EventsConfigMapsNode Conditions PVCsNetwork PoliciesResource Quotas DeploymentsStatefulSetsDaemonSets

Predictive GPU Health

XID Errors
0
Driver fault codes
DBE (Agg)
0
Uncorrectable ECC
Row Remaps
0/4
Before RMA
PCIe Replays
12
Bus retransmits
SBE (Volatile)
3
Corrected errors
Temperature
72°
Throttle at 85°C
Power Cap
850W
Of 1000W limit
Fleet Health
7/8
GPUs healthy

Ask Anything

You: Why is my training job stuck?

Checking cluster events...
Reading GPU metrics...
Analysing Run:ai state...

Job pslab-training is pending because GPU quota is
exhausted. Project pslab has 8/8 GPUs allocated.
Job is queued behind 2 higher-priority jobs.

Additionally, GPU 3 on dgxb200 has 2 uncorrectable
row remaps — recommend pre-emptive RMA before it
fails mid-training.

$ kubectl get runaijobs -n runai-pslab
$ kubectl describe node dgxb200 | grep -A5 gpu

Ready to close the loop?

One agent, read-only, deploys in 30 seconds. Works with any NVIDIA GPU cluster on Kubernetes.

Coming soon: disconnected (air-gap) environments

Sign In

Get notified about updates and new features