Senior Site Reliability Engineer — Token Factory (Inference Platform)
Nebius(7 months ago)
About this role
A Reliability Engineer on Nebius Cloud's Token Factory will own the reliability, performance, and observability of the inference stack powering large-scale multimodal AI models. The role focuses on scaling the platform to meet aggressive cost and reliability targets while ensuring fast, reliable inference across a global GPU fleet. You'll collaborate across engineering teams and drive automation, runbooks, and post‑mortems to maintain a self‑healing production environment.
Required Skills
- Kubernetes
- Prometheus
- Grafana
- Terraform
- IaC
- Python
- Bash
- Telemetry
- Observability
- SLOs
+5 more
About Nebius
nebius.comNebius is a cloud platform for AI explorers that provides GPU‑accelerated infrastructure to build, tune, and run machine learning models and applications. It offers access to top‑tier NVIDIA GPUs and tooling designed to maximize efficiency and performance for training, fine‑tuning, and inference. Nebius focuses on simplifying ML workflows so researchers, developers, and teams can iterate faster without managing hardware.
Apply instantly with AI
Let ApplyBlast auto-apply to jobs like this for you. Save hours on applications and land your dream job faster.
More jobs at Nebius
Similar Jobs
Director of Engineering, Inference Services
CoreWeave(1 month ago)
Senior Software Engineer I, Inference
CoreWeave(11 days ago)
Site Reliability Engineer
MarketAxess(13 days ago)
Senior Software Engineer, Cluster Orchestration
CoreWeave(11 days ago)
Senior DevOps Engineer (AI & Cloud Infrastructure)
Inflection AI(20 days ago)
Cloud Site Reliability Engineer
NICE(1 month ago)