poolside

Member of Engineering (Pre-training and inference fault tolerance)

poolside(2 months ago)

RemoteFull TimeJunior$69,127 - $92,565 (estimated)R&D
Apply Now

About this role

The Member of Engineering focused on Pre-training and Inference Fault Tolerance at Poolside plays a vital role in enhancing the reliability and fault tolerance of Large Language Models (LLMs) during distributed training. Responsibilities include troubleshooting hardware issues, minimizing GPU idle time, developing recovery tools, and improving checkpointing performance through high-quality coding in Python, C/C++, and CUDA. The position requires a strong background in engineering, a solid understanding of distributed systems and LLM fundamentals, and proficiency with frameworks like Torch. Candidates should possess strong algorithmic skills and be comfortable working within Linux environments.

View Original Listing

Required Skills

  • Hardware Troubleshooting
  • GPU Optimization
  • Tool Development
  • Checkpoint Reliability
  • High-Quality Code
  • Large Language Models
  • Transformers Knowledge
  • Deep Learning Fundamentals
  • Strong Engineering Background
  • Programming Experience

+15 more

poolside

About poolside

poolside.ai

Poolside is a foundation model company dedicated to infusing intelligence into the workplace, with the mission of driving abundance for humanity through the development of artificial general intelligence. By engaging in cutting-edge research, Poolside aims to transform frontier research into practical operational intelligence solutions. The company focuses on making advanced AI tools accessible across various domains of work.

ApplyBlast uses AI to match you with the right jobs, tailor your resume and cover letter, and apply automatically so you can land your dream job faster.

© All Rights Reserved. ApplyBlast.com