Member of Engineering (Pre-training and inference fault tolerance)
poolside(2 months ago)
About this role
The Member of Engineering focused on Pre-training and Inference Fault Tolerance at Poolside plays a vital role in enhancing the reliability and fault tolerance of Large Language Models (LLMs) during distributed training. Responsibilities include troubleshooting hardware issues, minimizing GPU idle time, developing recovery tools, and improving checkpointing performance through high-quality coding in Python, C/C++, and CUDA. The position requires a strong background in engineering, a solid understanding of distributed systems and LLM fundamentals, and proficiency with frameworks like Torch. Candidates should possess strong algorithmic skills and be comfortable working within Linux environments.
Required Skills
- Hardware Troubleshooting
- GPU Optimization
- Tool Development
- Checkpoint Reliability
- High-Quality Code
- Large Language Models
- Transformers Knowledge
- Deep Learning Fundamentals
- Strong Engineering Background
- Programming Experience
+15 more
About poolside
poolside.aiPoolside is a foundation model company dedicated to infusing intelligence into the workplace, with the mission of driving abundance for humanity through the development of artificial general intelligence. By engaging in cutting-edge research, Poolside aims to transform frontier research into practical operational intelligence solutions. The company focuses on making advanced AI tools accessible across various domains of work.
Apply instantly with AI
Let ApplyBlast auto-apply to jobs like this for you. Save hours on applications and land your dream job faster.
More jobs at poolside
Similar Jobs
Vehicle Application Engineer
42dot(3 months ago)
Member of Technical Staff - Efficient ML
Moonlake AI(2 months ago)
Senior Systems Engineer - AI Infrastructure
Clockwork.io(6 days ago)
Robotics Software Engineer – Fault Detection & Recovery
Serve Robotics(2 months ago)
Senior/Staff Software Engineer, Behavior Fault Architecture and Detection Validation
Nuro(6 days ago)
Systems Engineer - AI Infrastructure
Clockwork.io(6 days ago)