Site Reliability Engineer, AI/ML Infrastructure
Boson AI(1 month ago)
About this role
The Senior Site Reliability Engineer will own the full lifecycle of a large-scale GPU-accelerated HPC infrastructure in a Toronto datacenter, including NVIDIA H100/A100 clusters and multi-petabyte Ceph storage. The role focuses on planning, building, testing, deploying, and operating critical systems to ensure high performance and reliability. The engineer will collaborate closely with engineering and research teams to support their workloads and plan for future capacity and technology evolution. This position is ideal for someone who enjoys complex infrastructure challenges and continuous learning.
Required Skills
- HPC Operations
- Infrastructure As Code
- Cluster Management
- Linux Administration
- Kubernetes
- Container Orchestration
- Ceph Storage
- Network Fundamentals
- Security Best Practices
- Python
+14 more
About Boson AI
boson.aiBoson AI builds conversational and audio-generation AI focused on making interaction with machines "as easy, natural and fun as talking to a human." Their platform offers high‑fidelity, open‑source voice synthesis and multi‑speaker dialog generation, plus promptable audio (including sound effects) and emotional voice rendering. Boson provides APIs, demos and developer tools so teams can embed natural spoken interfaces into products. The company targets developers and businesses creating conversational experiences across products and platforms.
Apply instantly with AI
Let ApplyBlast auto-apply to jobs like this for you. Save hours on applications and land your dream job faster.
More jobs at Boson AI
Similar Jobs
Senior HPC Cluster Engineer
Nebius(10 months ago)
Principal HPC Engineer
Atto Trading Technologies(4 months ago)
Opportunistic Role
SF Compute(1 month ago)
Senior HPC Cluster Engineer
Nebius(1 year ago)
Staff Engineer, Distributed Storage and HPC & AI Infrastructure
Together AI(14 days ago)
Senior Datacenter Systems Architect
Sustainable Talent(2 months ago)