Nebius

Senior Site Reliability Engineer — Token Factory (Inference Platform)

Nebius(7 months ago)

HybridFull TimeSenior$160,929 - $215,650 (estimated)Site Reliability Engineering
Apply Now

About this role

A Reliability Engineer on Nebius Cloud's Token Factory will own the reliability, performance, and observability of the inference stack powering large-scale multimodal AI models. The role focuses on scaling the platform to meet aggressive cost and reliability targets while ensuring fast, reliable inference across a global GPU fleet. You'll collaborate across engineering teams and drive automation, runbooks, and post‑mortems to maintain a self‑healing production environment.

View Original Listing

Required Skills

  • Kubernetes
  • Prometheus
  • Grafana
  • Terraform
  • IaC
  • Python
  • Bash
  • Telemetry
  • Observability
  • SLOs

+5 more

Nebius

About Nebius

nebius.com

Nebius is a cloud platform for AI explorers that provides GPU‑accelerated infrastructure to build, tune, and run machine learning models and applications. It offers access to top‑tier NVIDIA GPUs and tooling designed to maximize efficiency and performance for training, fine‑tuning, and inference. Nebius focuses on simplifying ML workflows so researchers, developers, and teams can iterate faster without managing hardware.

ApplyBlast uses AI to match you with the right jobs, tailor your resume and cover letter, and apply automatically so you can land your dream job faster.

© All Rights Reserved. ApplyBlast.com