Senior Software Engineer, AI Resiliency
NVIDIA(1 month ago)
About this role
A Senior Software Engineer on NVIDIA's AI Resiliency team will lead efforts to design and advance software resiliency features for extremely large AI supercomputers. The role focuses on ensuring reliability and minimizing downtime for clusters at the scale of 100,000+ GPUs, contributing to NVIDIA's AI infrastructure initiatives. This position sits within NVIDIA's engineering organization and works cross-functionally to improve the robustness of AI training and inference systems.
Required Skills
- C++
- Python
- Distributed Systems
- Parallel Programming
- Fault Tolerance
- PyTorch
- JAX
- Debugging Tools
- Performance Tuning
- CI/CD
+6 more
Qualifications
- BS in Computer Science or related
- MS in Computer Science or related
- PhD in Computer Science or related
About NVIDIA
nvidia.comNVIDIA invents the GPU and drives advances in AI, HPC, gaming, creative design, autonomous vehicles, and robotics.
View more jobs at NVIDIA →Apply instantly with AI
Let ApplyBlast auto-apply to jobs like this for you. Save hours on applications and land your dream job faster.
More jobs at NVIDIA
Similar Jobs
Principal Software Architect- High Performance Computing
Applied Materials(12 days ago)
AI SW Runtime/Networking Engineer
Intel(21 days ago)
HPC System Engineer
Nebius(2 months ago)
Senior Software Engineer, Back End
Capital(1 month ago)
Research Engineer (LLM Training and Performance)
JetBrains(3 months ago)
SWE, Inference Performance, Onboard
Wayve(11 months ago)