Senior Site Reliability Engineer - Observability and Telemetry Platform
NVIDIA(1 month ago)
About this role
A Site Reliability Engineer at NVIDIA is responsible for ensuring high availability and efficient operation of large-scale GPU cloud services. The role focuses on designing and improving production systems and observability platforms to support performance, capacity, and developer velocity. It emphasizes automation, reliability engineering practices, and continuous system improvement in a collaborative engineering environment.
Required Skills
- Observability
- Telemetry
- Monitoring
- Logging
- Alerting
- Automation
- Capacity Management
- Linux
- Networking
- Containers
+10 more
Qualifications
- BS in Computer Science or Related Field
About NVIDIA
nvidia.comNVIDIA invents the GPU and drives advances in AI, HPC, gaming, creative design, autonomous vehicles, and robotics.
View more jobs at NVIDIA →Apply instantly with AI
Let ApplyBlast auto-apply to jobs like this for you. Save hours on applications and land your dream job faster.
More jobs at NVIDIA
Similar Jobs
Site Reliability Engineer - Observability
Wolt - English(11 months ago)
Software Engineer, Observability
Airtable(18 days ago)
Observability Engineer
LSEG(1 month ago)
Software Engineer, Observability
Airtable(1 month ago)
Staff+ Software Engineer, Observability
Anthropic(18 days ago)
Principal Site Reliability Engineer
Fidelity(2 months ago)