Staff Software Engineer, GPU Infrastructure (HPC)
Cohere(2 months ago)
About this role
As a Staff Software Engineer in GPU Infrastructure (HPC) at Cohere, you will design and operate ML-optimized, Kubernetes-based GPU/TPU superclusters across multiple clouds, focusing on stability, scalability, and performance for AI workloads. Your responsibilities include optimizing infrastructure with cloud providers, troubleshooting complex issues, and enabling researchers with self-service tools to manage their AI training jobs. You will collaborate closely with AI researchers to innovate and implement solutions that enhance machine learning infrastructure while championing best practices like automation and observability.
Required Skills
- ML Infrastructure
- Kubernetes
- HPC Infrastructure
- GPU Clusters
- Distributed Training
- Python
- Go
- Linux
- RDMA Networking
- Performance Optimization
+7 more
About Cohere
cohere.comSanity is a platform that provides flexible content management solutions tailored for developers, marketers, and content creators. By utilizing a real-time collaborative editor and structured content, it allows users to build and manage high-performance applications and websites. Sanity’s APIs and flexible data model enable seamless integration with various frameworks and technologies, empowering users to deliver customized content experiences. With features like query-driven content fetching and an extensible plugin system, Sanity is designed to enhance productivity and scalability for teams of all sizes.
Apply instantly with AI
Let ApplyBlast auto-apply to jobs like this for you. Save hours on applications and land your dream job faster.
More jobs at Cohere
Similar Jobs
Senior HPC Cluster Engineer
Nebius(1 year ago)
Senior HPC Developer - GPU and Networking
Clockwork.io(7 days ago)
Senior HPC Cluster Engineer
Nebius(10 months ago)
Senior HPC Operations Engineer
Lambda(2 months ago)
Software Engineer, Infrastructure
Exa(1 month ago)
Manager, HPC Storage Engineer
Runpod, Inc. (6 days ago)