NVIDIA

Senior Software Engineer, AI Resiliency

NVIDIA(1 month ago)

Santa Clara, CA, California, United StatesOnsiteFull TimeSenior$184,000 - $2,875,000Software Engineering
Apply Now

About this role

A Senior Software Engineer on NVIDIA's AI Resiliency team will lead efforts to design and advance software resiliency features for extremely large AI supercomputers. The role focuses on ensuring reliability and minimizing downtime for clusters at the scale of 100,000+ GPUs, contributing to NVIDIA's AI infrastructure initiatives. This position sits within NVIDIA's engineering organization and works cross-functionally to improve the robustness of AI training and inference systems.

View Original Listing

Required Skills

  • C++
  • Python
  • Distributed Systems
  • Parallel Programming
  • Fault Tolerance
  • PyTorch
  • JAX
  • Debugging Tools
  • Performance Tuning
  • CI/CD

+6 more

Qualifications

  • BS in Computer Science or related
  • MS in Computer Science or related
  • PhD in Computer Science or related
NVIDIA

About NVIDIA

nvidia.com

NVIDIA invents the GPU and drives advances in AI, HPC, gaming, creative design, autonomous vehicles, and robotics.

View more jobs at NVIDIA

ApplyBlast uses AI to match you with the right jobs, tailor your resume and cover letter, and apply automatically so you can land your dream job faster.

© All Rights Reserved. ApplyBlast.com