TensorWave

ML Cluster Operations Engineer

TensorWave(2 months ago)

HybridFull TimeSenior$160,328 - $215,211 (estimated)Engineering
Apply Now

About this role

The ML Cluster Operations Engineer at TensorWave is responsible for managing and optimizing containerized Slurm and Kubernetes solutions for distributed machine learning workloads. This senior-level role requires extensive experience in HPC and cloud infrastructure, focusing on cluster health, uptime, automated node management, and performance profiling. The engineer will collaborate with the development team to implement CI automation, establish best practices for job execution at scale, and mentor other ML engineers. Key technologies include Slurm, Kubernetes, and distributed ML frameworks such as Python and PyTorch.

View Original Listing

Required Skills

  • Slurm
  • Kubernetes
  • Cloud Infrastructure
  • HPC
  • Machine Learning
  • CI Automation
  • Health Checks
  • Node Lifecycle
  • Security
  • Compliance

+9 more

Qualifications

  • 5+ years of experience in cloud infrastructure, HPC, or machine learning roles
  • Significant hands-on experience with Slurm in production HPC/ML environments
  • Strong knowledge of distributed ML languages and frameworks, such as Python, PyTorch, Megatron, c10d, MPI
  • Deep understanding of security, compliance, and resilience in containerized workloads
  • 3+ years of hands-on Kubernetes experience
  • Proficiency in writing Kubernetes manifests, Helm charts, and managing releases
TensorWave

About TensorWave

tensorwave.com

TensorWave is a cloud computing platform specializing in artificial intelligence (AI) and high-performance computing (HPC) services powered by AMD Instinct™ GPUs. The company provides a scalable and memory-optimized infrastructure designed to facilitate the deployment and management of demanding AI workloads, including low-latency inference and large language models. With a focus on efficiency and cost-effectiveness, TensorWave's offerings include bare-metal solutions and managed inference services, tailored to meet the needs of enterprises looking to harness the power of next-generation AI technologies.

ApplyBlast uses AI to match you with the right jobs, tailor your resume and cover letter, and apply automatically so you can land your dream job faster.

© All Rights Reserved. ApplyBlast.com