Senior HPC Operations Engineer
Lambda(2 months ago)
About this role
The Senior HPC Operations Engineer at Lambda is responsible for the remote deployment and configuration of large-scale HPC clusters tailored for AI workloads. This role involves manual and automated installation of operating systems, software, and networking components, as well as troubleshooting issues in collaboration with on-site teams. The engineer will also mentor junior members, maintain Standard Operating Procedures, and contribute to improvements in operational efficiency while staying abreast of the latest HPC/AI technologies. This position requires extensive experience in HPC cluster management, strong technical skills in network fabrics, and familiarity with job scheduling systems like SLURM and Kubernetes.
Required Skills
- HPC Clusters
- Remotely Deploy
- Operating Systems
- Firmware
- Software
- Networking
- Troubleshooting
- Standard Operating Procedures
- Mentoring
- AI Technologies
+32 more
Qualifications
- Bachelors degree in EE, CS, Physics, Mathematics, or equivalent work experience
About Lambda
lambda.aiLambda is a cutting-edge cloud computing platform specializing in AI infrastructure, referred to as the "Superintelligence Cloud." It offers gigawatt-scale AI GPU cloud services with on-demand and reserved NVIDIA GPUs, designed specifically for AI training and inference. Lambda's solutions include private cloud options, one-click clusters for streamlined AI training, and orchestration tools for efficient workload management. The company is aimed at AI teams needing scalable solutions to develop and deploy advanced machine learning applications.
Apply instantly with AI
Let ApplyBlast auto-apply to jobs like this for you. Save hours on applications and land your dream job faster.
More jobs at Lambda
Similar Jobs
Senior HPC Cluster Engineer
Nebius(1 year ago)
Senior HPC Developer - GPU and Networking
Clockwork.io(7 days ago)
Network Engineer, AI/ML Infrastructure
Boson AI(1 month ago)
Senior Systems Engineer - AI Infrastructure
Clockwork.io(6 days ago)
Senior HPC Cluster Engineer
Nebius(10 months ago)
Network Engineer, AI/ML Infrastructure
Boson AI(1 month ago)