Senior Software Engineer, AI Resiliency
NVIDIA(20 days ago)
About this role
A Senior Software Engineer on NVIDIA's AI Resiliency team will help define and advance software resiliency for large-scale AI supercomputers (100,000+ GPUs). The role focuses on ensuring system robustness and minimizing cluster downtime for AI training and inference infrastructure. You'll work with cross-functional teams to scale and validate resilient AI systems.
Required Skills
- C++
- Python
- Distributed Systems
- Parallel Programming
- Fault Tolerance
- Checkpointing
- Debugging Tools
- Performance Tuning
- CUDA
- NCCL
+5 more
Qualifications
- Bachelor’s in Computer Science, Electrical Engineering, or related
- Master’s in Computer Science, Electrical Engineering, or related
- PhD in Computer Science, Electrical Engineering, or related
About NVIDIA
nvidia.comNVIDIA invents the GPU and drives advances in AI, HPC, gaming, creative design, autonomous vehicles, and robotics.
View more jobs at NVIDIA →Apply instantly with AI
Let ApplyBlast auto-apply to jobs like this for you. Save hours on applications and land your dream job faster.
More jobs at NVIDIA
Similar Jobs
AI SW Runtime/Networking Engineer
Intel(21 days ago)
SWE, Inference Performance, Onboard
Wayve(11 months ago)
Senior Software Engineer, Back End
Capital(1 month ago)
CFD Geometry and Meshing Developer
Flexcompute Inc.(8 months ago)
AI SW Runtime/Networking Engineer
Intel(3 months ago)
Resiliency Manager - Infrastructure
Brown Brothers Harriman(3 months ago)