Senior System Architect, Infrastructure Reliability
NVIDIA(1 day ago)
About this role
NVIDIA is hiring a Senior System Architect specializing in Heterogeneous EDA Systems to develop an automated framework for failure attribution in high-performance computing environments. The role involves designing scalable diagnostic tools that analyze system telemetry to identify root causes of job failures across CPU, GPU, and system infrastructure.
Required Skills
- C++
- Python
- Distributed Systems
- Linux
- Cluster Management
- Telemetry
- GPU Monitoring
- System Diagnostics
- Machine Learning
- High-Performance Computing
About NVIDIA
nvidia.comNVIDIA invents the GPU and drives advances in AI, HPC, gaming, creative design, autonomous vehicles, and robotics.
View more jobs at NVIDIA →Apply instantly with AI
Let ApplyBlast auto-apply to jobs like this for you. Save hours on applications and land your dream job faster.
More jobs at NVIDIA
Similar Jobs
Senior Engineer, Failure Analysis, NTI
Eightfold(3 months ago)
Electrical Engineer II - Failure Analysis
RTX(2 months ago)
Senior Staff Failure Analysis Engineer-Electrical
Stryker(2 days ago)
Failure Analysis Engineer
Samsung Research America(17 days ago)
Part Time Avionics Failure Analyst
Homepage(7 days ago)
Pr. Electronic Materials Failure Analyst
RTX(10 days ago)