The Role:
As a Training Infrastructure Engineer, you'll design, build, and optimize the infrastructure that powers our large-scale model training operations. Your work will be essential to developing high-performance AI training infrastructure. You'll collaborate with AI researchers and engineers to create robust training pipelines, optimize distributed training workloads, and ensure reliable model development.
Key Responsibilities:
- Design and implement scalable infrastructure for large-scale model training workloads
- Develop and maintain distributed training pipelines for LLMs and multimodal models
- Optimize training performance across multiple GPUs, nodes, and data centers
- Implement monitoring, logging, and debugging tools for training operations
- Architect and maintain data storage solutions for large-scale training datasets
- Automate infrastructure provisioning, scaling, and orchestration for model training
- Collaborate with researchers to implement and optimize training methodologies
- Analyze and improve efficiency, scalability, and cost-effectiveness of training systems
- Troubleshoot complex performance issues in distributed training environments
Minimum Qualifications:
- Bachelor's degree in Computer Science, Computer Engineering, or related field, or equivalent practical experience
- 3+ years of experience with distributed systems and ML infrastructure
- Experience with PyTorch
- Proficiency in cloud platforms (AWS, GCP, Azure)
- Experience with containerization, orchestration (Kubernetes, Docker)
- Knowledge of distributed training techniques (data parallelism, model parallelism, FSDP)
Preferred Qualifications:
- Master's or PhD in Computer Science or related field
- Experience training large language models or multimodal AI systems
- Experience with ML workflow orchestration tools
- Background in optimizing high-performance distributed computing systems
- Familiarity with ML DevOps practices
- Contributions to open-source ML infrastructure or related projects