OpenAI

Site Reliability Engineer, Frontier Systems Infrastructure

OpenAI

4 months ago
San Francisco, CA
Onsite
Full Time
Senior
1 applicant
View Job Listing
OpenAI
Apply to 100+ jobs

About this role

A Site Reliability Engineer on the Frontier Systems Infrastructure team at OpenAI is responsible for operating and maintaining massive Kubernetes clusters that support advanced AI model training. This role combines distributed systems engineering with hands-on infrastructure work, focusing on automating cluster provisioning, enhancing system reliability, and building software abstractions to streamline operations across multiple data centers. The engineer will troubleshoot issues in high-pressure situations, manage bare-metal deployments, and optimize operational metrics to ensure efficient and reliable supercomputer performance.

Skills

OpenAI

About OpenAI

openai.com

OpenAI is an AI research and deployment company that develops advanced artificial intelligence models and products, including ChatGPT and APIs, to help people and organizations solve problems, create content, and build intelligent applications.

About OpenAI

Headquarters

San Francisco, CA

Company Size

201-500 employees

Founded

2018

Industry

Technology

Glassdoor Rating

4.2 / 5

Leadership Team

Sarah Johnson

Chief Executive Officer

Michael Chen

Chief Technology Officer

Emily Williams

VP of Engineering

David Rodriguez

VP of Product

Jessica Thompson

Chief Financial Officer

Andrew Park

VP of Sales

Unlock Company Insights

View leadership team, funding history,
and employee contacts for OpenAI.

Reveal Company Insights

ApplyBlast uses AI to match you with the right jobs, tailor your resume and cover letter, and apply automatically so you can land your dream job faster.

© All Rights Reserved. ApplyBlast.com