Site Reliability Engineer, Frontier Systems Infrastructure
OpenAI
About this role
A Site Reliability Engineer on the Frontier Systems Infrastructure team at OpenAI is responsible for operating and maintaining massive Kubernetes clusters that support advanced AI model training. This role combines distributed systems engineering with hands-on infrastructure work, focusing on automating cluster provisioning, enhancing system reliability, and building software abstractions to streamline operations across multiple data centers. The engineer will troubleshoot issues in high-pressure situations, manage bare-metal deployments, and optimize operational metrics to ensure efficient and reliable supercomputer performance.
Skills
About OpenAI
openai.comOpenAI is an AI research and deployment company that develops advanced artificial intelligence models and products, including ChatGPT and APIs, to help people and organizations solve problems, create content, and build intelligent applications.
Recent company news
OpenAI to acquire Promptfoo
2 days ago
Google And OpenAI Staff Unite Behind Anthropic As It Sues Pentagon
1 day ago
Workers at OpenAI show support for Anthropic as the company says it could lose $5 billion in its feud with the Pentagon
2 days ago
OpenAI is feeling the heat, but it’s not cooked
3 days ago
Metadata company Gracenote is the latest to sue OpenAI for copyright infringement
1 day ago
About OpenAI
Headquarters
San Francisco, CA
Company Size
201-500 employees
Founded
2018
Industry
Technology
Glassdoor Rating
4.2 / 5
Leadership Team
Sarah Johnson
Chief Executive Officer
Michael Chen
Chief Technology Officer
Emily Williams
VP of Engineering
David Rodriguez
VP of Product
Jessica Thompson
Chief Financial Officer
Andrew Park
VP of Sales
Unlock Company Insights
View leadership team, funding history,
and employee contacts for OpenAI.
Salary
$255k – $490k
per year