Principal Cluster Engineer, Training Infrastructure

verda

23 days ago

Remote

Full Time

Senior

0 applicants

View Job Listing

Apply to 100+ jobs

About this role

Verda is hiring a Principal Cluster Engineer to own and evolve our InfiniBand-connected GPU training infrastructure, building and operating large-scale AI and HPC clusters for next-generation machine learning workloads. You will collaborate with ML researchers, cloud platform teams, datacenter operations, and procurement to ensure the GPU infrastructure is fast, reliable, and scalable. The role emphasizes architecture, automation, observability, and defining technical standards to translate customer and product requirements into robust infrastructure capabilities.

Skills

About verda

verda.com

Verda (formerly DataCrunch) is a European AI cloud provider that offers on-demand GPU instances, autoscaling clusters, managed endpoints and serverless inference to host and deploy models in production. It supplies self-service instances and clusters powered by the latest NVIDIA hardware (B200, H200, H100, A100, L40S, RTX series) and tooling to start, stop, or hibernate via dashboard or API for cost-efficient, high-performance deployments. Verda is ISO27001-certified, GDPR-compliant, runs on 100% renewable energy, and provides engineer support via in-dashboard chat. Backed by in-house AI R&D, it targets AI teams seeking secure, EU-based GPU infrastructure and managed inference.