Senior HPC Operations Engineer

Lambda

4 months ago

3.4 on Glassdoor

San Francisco, CA

Hybrid

Full Time

Senior

0 applicants

View Job Listing

Apply to 100+ jobs

About this role

The Senior HPC Operations Engineer at Lambda is responsible for the remote deployment and configuration of large-scale HPC clusters tailored for AI workloads. This role involves manual and automated installation of operating systems, software, and networking components, as well as troubleshooting issues in collaboration with on-site teams. The engineer will also mentor junior members, maintain Standard Operating Procedures, and contribute to improvements in operational efficiency while staying abreast of the latest HPC/AI technologies. This position requires extensive experience in HPC cluster management, strong technical skills in network fabrics, and familiarity with job scheduling systems like SLURM and Kubernetes.

Skills

Qualifications

Bachelors degree in EE, CS, Physics, Mathematics, or equivalent work experience

About Lambda

lambda.ai

Lambda is a cutting-edge cloud computing platform specializing in AI infrastructure, referred to as the "Superintelligence Cloud." It offers gigawatt-scale AI GPU cloud services with on-demand and reserved NVIDIA GPUs, designed specifically for AI training and inference. Lambda's solutions include private cloud options, one-click clusters for streamlined AI training, and orchestration tools for efficient workload management. The company is aimed at AI teams needing scalable solutions to develop and deploy advanced machine learning applications.