Site Reliability Engineer, AI/ML Infrastructure
Boson AI(1 month ago)
About this role
This Senior Site Reliability Engineer role focuses on operating and scaling a large GPU-based HPC datacenter in Toronto. The position spans the full lifecycle of high-performance infrastructure from planning and deployment to ongoing reliability and performance optimization. The engineer will partner closely with ML and research teams to ensure the cluster meets evolving compute and storage needs while evaluating new technologies as the environment grows.
Required Skills
- HPC Operations
- Infrastructure As Code
- Cluster Optimization
- Ceph Storage
- Automation Tooling
- Linux Administration
- Kubernetes
- Container Orchestration
- Multi Tenant Security
- L2 L3 Networking
+12 more
Qualifications
- 5+ years SRE or HPC Operations Experience
About Boson AI
boson.aiBoson AI builds conversational and audio-generation AI focused on making interaction with machines "as easy, natural and fun as talking to a human." Their platform offers high‑fidelity, open‑source voice synthesis and multi‑speaker dialog generation, plus promptable audio (including sound effects) and emotional voice rendering. Boson provides APIs, demos and developer tools so teams can embed natural spoken interfaces into products. The company targets developers and businesses creating conversational experiences across products and platforms.
Apply instantly with AI
Let ApplyBlast auto-apply to jobs like this for you. Save hours on applications and land your dream job faster.
More jobs at Boson AI
Similar Jobs
Senior HPC Cluster Engineer
Nebius(10 months ago)
Principal HPC Engineer
Atto Trading Technologies(4 months ago)
Senior HPC Cluster Engineer
Nebius(1 year ago)
Senior HPC Developer - GPU and Networking
Clockwork.io(7 days ago)
Senior Datacenter Systems Architect
Sustainable Talent(2 months ago)
Senior Datacenter Systems Architect
Sustainable Talent(2 months ago)