NVIDIA

Senior Software Engineer, AI Resiliency

NVIDIA(20 days ago)

HybridFull TimeSenior$184,000 - $287,500Engineering
Apply Now

About this role

A Senior Software Engineer on NVIDIA's AI Resiliency team will help define and advance software resiliency for large-scale AI supercomputers (100,000+ GPUs). The role focuses on ensuring system robustness and minimizing cluster downtime for AI training and inference infrastructure. You'll work with cross-functional teams to scale and validate resilient AI systems.

View Original Listing

Required Skills

  • C++
  • Python
  • Distributed Systems
  • Parallel Programming
  • Fault Tolerance
  • Checkpointing
  • Debugging Tools
  • Performance Tuning
  • CUDA
  • NCCL

+5 more

Qualifications

  • Bachelor’s in Computer Science, Electrical Engineering, or related
  • Master’s in Computer Science, Electrical Engineering, or related
  • PhD in Computer Science, Electrical Engineering, or related
NVIDIA

About NVIDIA

nvidia.com

NVIDIA invents the GPU and drives advances in AI, HPC, gaming, creative design, autonomous vehicles, and robotics.

View more jobs at NVIDIA

ApplyBlast uses AI to match you with the right jobs, tailor your resume and cover letter, and apply automatically so you can land your dream job faster.

© All Rights Reserved. ApplyBlast.com