Site Reliability Engineer, Frontier Systems Infrastructure
OpenAI(3 months ago)
About this role
A Site Reliability Engineer on the Frontier Systems Infrastructure team at OpenAI is responsible for operating and maintaining massive Kubernetes clusters that support advanced AI model training. This role combines distributed systems engineering with hands-on infrastructure work, focusing on automating cluster provisioning, enhancing system reliability, and building software abstractions to streamline operations across multiple data centers. The engineer will troubleshoot issues in high-pressure situations, manage bare-metal deployments, and optimize operational metrics to ensure efficient and reliable supercomputer performance.
Required Skills
- Kubernetes
- Distributed Systems
- Infrastructure
- Automation
- Node Management
- Monitoring
- Observability
- Programming
- Scripting
- Linux
+6 more
About OpenAI
openai.comThe company specializes in providing innovative solutions to enhance online experiences for businesses and consumers. It offers a range of products and services designed to streamline operations, improve customer engagement, and drive growth in the digital landscape. With a focus on user experience and cutting-edge technology, the company aims to empower businesses to thrive in a rapidly evolving market.
Apply instantly with AI
Let ApplyBlast auto-apply to jobs like this for you. Save hours on applications and land your dream job faster.
More jobs at OpenAI
Similar Jobs
Senior Site Reliability Engineer
BetterUp(1 month ago)
Engineering Manager, Systems
Anthropic(1 month ago)
Manager, Bare Metal Support Engineering
CoreWeave(6 days ago)
Senior Platform Engineer
Pluralis Research(1 month ago)
Senior Platform/DevOps Engineer (Kubernetes-Linux-Azure Local)
Armada(20 days ago)
DevOps Engineer
BJAK(26 days ago)