Senior Product Manager - Observability and Resilience
NVIDIA(1 month ago)
About this role
A Product Manager at NVIDIA who will lead development of foundational resiliency and observability tools for large‑scale accelerated computing platforms. The role focuses on ensuring system diagnostics, performance monitoring, and automated recovery to maximize uptime and efficiency for AI training and inference workloads. The position supports deployment and operation of AI infrastructure across customers and partners.
Required Skills
- Resiliency
- Observability
- GPU Observability
- Telemetry
- Reliability
- Kubernetes
- Containerization
- Cloud
- Networking
- HPC
+9 more
Qualifications
- BS in Computer Science or related
- MS in Computer Science or related
About NVIDIA
nvidia.comNVIDIA invents the GPU and drives advances in AI, HPC, gaming, creative design, autonomous vehicles, and robotics.
View more jobs at NVIDIA →Apply instantly with AI
Let ApplyBlast auto-apply to jobs like this for you. Save hours on applications and land your dream job faster.
More jobs at NVIDIA
Similar Jobs
Observability Engineer
LSEG(1 month ago)
Enterprise Observability Architect
MSD(17 days ago)
Software Engineer - Observability
Wolt - English(11 months ago)
Observability Engineer
TensorWave(1 month ago)
Software Engineer, Observability
Airtable(20 days ago)
Software Engineer, Observability
Airtable(1 month ago)