Site Reliability Engineer, AI/ML Infrastructure

Boson AI(1 month ago)

Santa Clara, CAOnsiteFull TimeSenior$160,833 - $217,533 (estimated)Site Reliability Engineering

About this role

The Senior Site Reliability Engineer will own the reliability and performance of a large-scale GPU-powered HPC cluster in a Toronto datacenter. The role spans the full lifecycle of infrastructure, from planning and deployment to operations and scaling. The engineer will collaborate with engineering and research teams to ensure efficient cluster usage and support growth. They will also participate in capacity planning and evaluation of new technologies as the environment scales.

View Original Listing

Required Skills

HPC Operations
Cluster Management
Infrastructure As Code
Linux Administration
Kubernetes
Container Orchestration
Ceph Storage
Performance Monitoring
Troubleshooting
Automation Development

+13 more

Qualifications

5+ years SRE experience
5+ years HPC operations experience

About Boson AI

boson.ai

Boson AI builds conversational and audio-generation AI focused on making interaction with machines "as easy, natural and fun as talking to a human." Their platform offers high‑fidelity, open‑source voice synthesis and multi‑speaker dialog generation, plus promptable audio (including sound effects) and emotional voice rendering. Boson provides APIs, demos and developer tools so teams can embed natural spoken interfaces into products. The company targets developers and businesses creating conversational experiences across products and platforms.

View more jobs at Boson AI →