Staff Site Reliability Engineer, Compute

Crusoe

3 months ago

3.7 on Glassdoor

San Francisco, CA

Onsite

Full Time

Senior

0 applicants

View Job Listing

Apply to 100+ jobs

About this role

The Staff Site Reliability Engineer, Compute at Crusoe is responsible for deploying and optimizing sustainable, AI-first cloud infrastructure, focusing on virtualization, hypervisor, and kernel-level performance. This role involves developing automation and observability tools, managing the virtualization stack using technologies like KVM and QEMU, and collaborating with hardware teams to troubleshoot performance bottlenecks and optimize workloads for modern AI and HPC demands. The engineer will also be engaged in kernel tuning, system-level debugging, and integrating enhancements for guest VM reliability.

Skills

Qualifications

8+ years of professional experience in Compute SRE, Linux system engineering, or compute infrastructure rolesStrong proficiency in Linux kernel internalsExperience with virtualization architectures and technologies such as KVM, Xen, QEMU, or VMwareFamiliarity with SmartNICs/DPUs (e.g., NVIDIA CX6/7, BlueField-3)Expert-level skills in at least one programming language: Go, C or Rust

About Crusoe

crusoe.ai

Crusoe is a leading provider of next-generation AI infrastructure that focuses on renewable-powered cloud computing solutions. By employing an energy-first approach, Crusoe enables businesses to deploy AI workloads at scale while ensuring reliable performance and round-the-clock support. The company is committed to advancing sustainable technology, making it a strategic partner for organizations looking to enhance their AI capabilities in an environmentally conscious manner.