Running machine learning (ML) algorithms at our scale often requires solving novel systems problems. As a Performance Engineer, you'll be responsible for identifying these problems, and then developing systems that optimize the throughput and robustness of our largest distributed systems. Strong candidates here will have a track record of solving large-scale systems problems and will be excited to grow to become an expert in ML also.
You may be a good fit if you:
- Have significant software engineering or machine learning experience, particularly at supercomputing scale
- Are results-oriented, with a bias towards flexibility and impact
- Pick up slack, even if it goes outside your job description
- Enjoy pair programming (we love to pair!)
- Want to learn more about machine learning research
- Care about the societal impacts of your work
Strong candidates may also have experience with:
- High performance, large-scale ML systems
- GPU/Accelerator programming
- ML framework internals
- OS internals
- Language modeling with transformers
Representative projects:
- Implement low-latency high-throughput sampling for large language models
- Implement GPU kernels to adapt our models to low-precision inference
- Write a custom load-balancing algorithm to optimize serving efficiency
- Build quantitative models of system performance
- Design and implement a fault-tolerant distributed system running with a complex network topology
- Debug kernel-level network latency spikes in a containerized environment
Deadline to apply: None. Applications will be reviewed on a rolling basis.