ML Cluster Operations Engineer
TensorWave(2 months ago)
About this role
The ML Cluster Operations Engineer at TensorWave is responsible for managing and optimizing containerized Slurm and Kubernetes solutions for distributed machine learning workloads. This senior-level role requires extensive experience in HPC and cloud infrastructure, focusing on cluster health, uptime, automated node management, and performance profiling. The engineer will collaborate with the development team to implement CI automation, establish best practices for job execution at scale, and mentor other ML engineers. Key technologies include Slurm, Kubernetes, and distributed ML frameworks such as Python and PyTorch.
Required Skills
- Slurm
- Kubernetes
- Cloud Infrastructure
- HPC
- Machine Learning
- CI Automation
- Health Checks
- Node Lifecycle
- Security
- Compliance
+9 more
Qualifications
- 5+ years of experience in cloud infrastructure, HPC, or machine learning roles
- Significant hands-on experience with Slurm in production HPC/ML environments
- Strong knowledge of distributed ML languages and frameworks, such as Python, PyTorch, Megatron, c10d, MPI
- Deep understanding of security, compliance, and resilience in containerized workloads
- 3+ years of hands-on Kubernetes experience
- Proficiency in writing Kubernetes manifests, Helm charts, and managing releases
About TensorWave
tensorwave.comTensorWave is a cloud computing platform specializing in artificial intelligence (AI) and high-performance computing (HPC) services powered by AMD Instinct™ GPUs. The company provides a scalable and memory-optimized infrastructure designed to facilitate the deployment and management of demanding AI workloads, including low-latency inference and large language models. With a focus on efficiency and cost-effectiveness, TensorWave's offerings include bare-metal solutions and managed inference services, tailored to meet the needs of enterprises looking to harness the power of next-generation AI technologies.
Apply instantly with AI
Let ApplyBlast auto-apply to jobs like this for you. Save hours on applications and land your dream job faster.
More jobs at TensorWave
Similar Jobs
Senior HPC Cluster Engineer
Nebius(10 months ago)
Senior HPC Operations Engineer
Lambda(2 months ago)
Senior HPC Cluster Engineer
Nebius(1 year ago)
Member of Technical Staff - Efficient ML
Moonlake AI(2 months ago)
Site Reliability Engineer, AI/ML Infrastructure
Boson AI(1 day ago)
Site Reliability Engineer, AI/ML Infrastructure
Boson AI(8 days ago)