OpenAI

Site Reliability Engineer, Frontier Systems Infrastructure

OpenAI(3 months ago)

San Francisco, CAOnsiteFull TimeSenior$255,000 - $490,000Scaling
Apply Now

About this role

A Site Reliability Engineer on the Frontier Systems Infrastructure team at OpenAI is responsible for operating and maintaining massive Kubernetes clusters that support advanced AI model training. This role combines distributed systems engineering with hands-on infrastructure work, focusing on automating cluster provisioning, enhancing system reliability, and building software abstractions to streamline operations across multiple data centers. The engineer will troubleshoot issues in high-pressure situations, manage bare-metal deployments, and optimize operational metrics to ensure efficient and reliable supercomputer performance.

View Original Listing

Required Skills

  • Kubernetes
  • Distributed Systems
  • Infrastructure
  • Automation
  • Node Management
  • Monitoring
  • Observability
  • Programming
  • Scripting
  • Linux

+6 more

OpenAI

About OpenAI

openai.com

The company specializes in providing innovative solutions to enhance online experiences for businesses and consumers. It offers a range of products and services designed to streamline operations, improve customer engagement, and drive growth in the digital landscape. With a focus on user experience and cutting-edge technology, the company aims to empower businesses to thrive in a rapidly evolving market.

ApplyBlast uses AI to match you with the right jobs, tailor your resume and cover letter, and apply automatically so you can land your dream job faster.

© All Rights Reserved. ApplyBlast.com