Site Reliability Engineer, AI/ML Infrastructure
Boson AI(29 days ago)
About this role
The Senior Site Reliability Engineer will own the lifecycle of a large-scale GPU-focused HPC infrastructure, including planning, building, testing, and deploying systems in a Toronto datacenter. The role focuses on ensuring reliability and performance of GPU clusters, storage, and networking while supporting internal engineering and research teams. It also includes capacity planning and evaluation of new technologies as the environment scales.
Required Skills
- HPC Operations
- Cluster Management
- Infrastructure As Code
- Linux Administration
- Kubernetes
- Container Orchestration
- Ceph Storage
- Network Troubleshooting
- L2/L3 Networking
- Security Best Practices
+11 more
Qualifications
- 5+ years SRE Experience
About Boson AI
boson.aiBoson AI builds conversational and audio-generation AI focused on making interaction with machines "as easy, natural and fun as talking to a human." Their platform offers high‑fidelity, open‑source voice synthesis and multi‑speaker dialog generation, plus promptable audio (including sound effects) and emotional voice rendering. Boson provides APIs, demos and developer tools so teams can embed natural spoken interfaces into products. The company targets developers and businesses creating conversational experiences across products and platforms.
Apply instantly with AI
Let ApplyBlast auto-apply to jobs like this for you. Save hours on applications and land your dream job faster.
More jobs at Boson AI
Similar Jobs
Principal HPC Engineer
Atto Trading Technologies(4 months ago)
Opportunistic Role
SF Compute(1 month ago)
Senior HPC Cluster Engineer
Nebius(10 months ago)
Senior Datacenter Systems Architect
Sustainable Talent(2 months ago)
Senior Datacenter Systems Architect
Sustainable Talent(2 months ago)
HPC Systems Engineer, Consumer Products
OpenAI(1 month ago)