Site Reliability Engineer, AI/ML Infrastructure
Boson AI(8 days ago)
About this role
Senior Site Reliability Engineer role based in Toronto responsible for operating and evolving a large GPU-focused HPC datacenter featuring NVIDIA H100/A100 GPUs, Ceph storage, and high-speed networking. The position supports ML and research teams by ensuring reliable access to compute resources, collaborating across engineering and science groups, and planning capacity and technology roadmaps as the cluster scales. The role emphasizes hands-on ownership of the infrastructure lifecycle and continuous improvement.
Required Skills
- Linux
- Kubernetes
- Ceph
- Python
- Bash
- Ansible
- Terraform
- GitOps
- Helm
- ArgoCD
+6 more
About Boson AI
boson.aiBoson AI builds conversational and audio-generation AI focused on making interaction with machines "as easy, natural and fun as talking to a human." Their platform offers high‑fidelity, open‑source voice synthesis and multi‑speaker dialog generation, plus promptable audio (including sound effects) and emotional voice rendering. Boson provides APIs, demos and developer tools so teams can embed natural spoken interfaces into products. The company targets developers and businesses creating conversational experiences across products and platforms.
Apply instantly with AI
Let ApplyBlast auto-apply to jobs like this for you. Save hours on applications and land your dream job faster.
More jobs at Boson AI
Similar Jobs
Senior HPC Developer - GPU and Networking
Clockwork.io(7 days ago)
Senior Systems Engineer - AI Infrastructure
Clockwork.io(6 days ago)
Senior Datacenter Systems Architect
Sustainable Talent(2 months ago)
Senior Datacenter Systems Architect
Sustainable Talent(2 months ago)
Senior HPC Cluster Engineer
Nebius(10 months ago)
Senior HPC Cluster Engineer
Nebius(1 year ago)