Site Reliability Engineer, AI/ML Infrastructure
Boson AI(1 month ago)
About this role
The Senior Site Reliability Engineer will operate and scale a large GPU-based HPC cluster in a Toronto datacenter, supporting cutting-edge machine learning and research workloads. The role covers the full lifecycle of HPC infrastructure, from planning and deployment to ongoing reliability and performance. It involves close collaboration with engineering and science teams to ensure infrastructure meets their needs and can scale with future demand. The position is ideal for someone who enjoys complex technical environments and continuous learning.
Required Skills
- HPC Operations
- Cluster Management
- Infrastructure As Code
- Linux Administration
- Kubernetes
- Container Orchestration
- Ceph Storage
- Performance Monitoring
- Troubleshooting
- Automation Development
+15 more
Qualifications
- 5+ Years SRE Experience
- 5+ Years HPC Operations Experience
About Boson AI
boson.aiBoson AI builds conversational and audio-generation AI focused on making interaction with machines "as easy, natural and fun as talking to a human." Their platform offers high‑fidelity, open‑source voice synthesis and multi‑speaker dialog generation, plus promptable audio (including sound effects) and emotional voice rendering. Boson provides APIs, demos and developer tools so teams can embed natural spoken interfaces into products. The company targets developers and businesses creating conversational experiences across products and platforms.
View more jobs at Boson AI →Apply instantly with AI
Let ApplyBlast auto-apply to jobs like this for you. Save hours on applications and land your dream job faster.
More jobs at Boson AI
Similar Jobs
Senior HPC Cluster Engineer
Nebius(11 months ago)
HPC Solutions Architect
Lavendo(9 days ago)
Senior HPC Cluster Engineer
Nebius(1 year ago)
Senior HPC Developer - GPU and Networking
Clockwork.io(18 days ago)
Senior Systems Engineer - AI Infrastructure
Clockwork.io(17 days ago)
Principal HPC Engineer
Atto Trading Technologies(4 months ago)