Agentic AI for GPU Infrastructure

Detect. Analyse.
Alert. Remediate.

Full situational awareness for your GPU cluster.

GPUPilot monitors your entire Kubernetes GPU cluster — detects anomalies in real time, analyses root causes with AI, alerts your team, and remediates with one click.

RKE2OpenShiftGKEEKSAKSVCFRancherK3sRun:aiDCGM

The Closed Loop

Connect

One kubectl command. Read-only agent. Deploys in 30 seconds.

Detect

30+ DCGM metrics, pods, logs, events. Every 30 seconds. Nothing missed.

Analyse

Claude AI investigates every anomaly. Correlates events, finds root cause.

Alert

Instant Slack with diagnosis, severity, and fix. Before you check the dashboard.

Remediate

One-click approve or auto-fix. Every action logged and auditable.

Deploy in 30 seconds

kubectl apply -f https://gpupilot.io/api/install/YOUR_TOKEN

Real-Time Detection & Remediation

Every anomaly triggers an AI investigation, a Slack alert, and a suggested fix — automatically.

🔴

XID Error Detected

GPU fault on dgxb200:GPU3. AI: driver mismatch after update.

Approve Fixkubectl drain dgxb200 --ignore-daemonsets

🔴

ECC Double-Bit Error

Uncorrectable fault on GPU 5. Row remaps: 3/4. Failure in ~2 weeks.

Approve RMAPre-emptive replacement

🟠

GPU Utilization Drop

87% to 12% in 5 minutes. Scheduler stuck on node affinity.

Auto-FixedAffinity relaxed, jobs rescheduled

✅

Node Recovered

worker-07 back online. NVSwitch firmware updated. 4 jobs resumed.

ResolvedDowntime: 4m 12s

What GPUPilot Monitors

GPU UtilizationVRAM UsageTemperature Power DrawXID ErrorsECC (SBE/DBE) Row RemapsPCIe ReplaysNVLink Bandwidth SM/Mem ClocksRun:ai JobsPod Logs K8s EventsConfigMapsNode Conditions PVCsNetwork PoliciesResource Quotas DeploymentsStatefulSetsDaemonSets

Predictive GPU Health

XID Errors

Driver fault codes

DBE (Agg)

Uncorrectable ECC

Row Remaps

0/4

Before RMA

PCIe Replays

Bus retransmits

SBE (Volatile)

Corrected errors

Temperature

72°

Throttle at 85°C

Power Cap

850W

Of 1000W limit

Fleet Health

7/8

GPUs healthy

Ask Anything

You: Why is my training job stuck?

Checking cluster events...

Reading GPU metrics...

Analysing Run:ai state...

Job pslab-training is pending because GPU quota is
exhausted. Project pslab has 8/8 GPUs allocated.
Job is queued behind 2 higher-priority jobs.

Additionally, GPU 3 on dgxb200 has 2 uncorrectable
row remaps — recommend pre-emptive RMA before it
fails mid-training.

$ kubectl get runaijobs -n runai-pslab

$ kubectl describe node dgxb200 | grep -A5 gpu

Ready to close the loop?

One agent, read-only, deploys in 30 seconds. Works with any NVIDIA GPU cluster on Kubernetes.

Coming soon: disconnected (air-gap) environments

Detect. Analyse.Alert. Remediate.