Job Duties: Design, develop, and maintain large-scale backend and cloud-native infrastructure to
support distributed machine learning training, inference, and data processing pipelines for generative AI platform.
Architect and build scalable, resilient backend infrastructure to support distributed training, inference, and data
processing pipelines. Lead technical design discussions, mentor engineers, and establish best practices for
large-scale machine learning systems. Design and implement core backend services with a focus on efficiency
and low latency. Drive infrastructure optimization initiatives for compute cost, storage lifecycle management, and
network performance. Collaborate with machine learning, DevOps, and product teams to translate research and
product requirements into robust infrastructure solutions. Evaluate and integrate cloud-native and open-source
technologies such as Kubernetes, Ray, Kubeflow, and MLFlow to enhance platform reliability. Own end-to-end
systems from design to deployment, emphasizing reliability, fault tolerance, and operational excellence.
Minimum Education & Experience Required: Bachelor’s degree or equivalent in Computer Science or related
field plus four (4) years of experience in software engineering or related role
Minimum Skills Required: 4 years of experience designing, building, and optimizing large-scale backend
infrastructure and distributed data systems (e.g., PostgreSQL, MySQL, DynamoDB, Apache Spark, Apache
Flink, Apache Kafka) in cloud environments (AWS, GCP, Azure, or equivalent), including cloud-native platforms,
core infrastructure components, and optimization techniques (caching, indexing, sharding, replication,
transactions, ACID). 4 years of experience with major server-side programming languages and frameworks
(e.g., Python, C++, Go, TypeScript). 4 years of experience writing technical design documentation, leading cross-
functional projects, and collaborating with cross-functional teams to achieve business impact. 3 years of
experience developing and maintaining data processing and API systems, including client-server communication
frameworks (e.g., gRPC, Thrift). 3 years of experience conducting A/B testing and scientific experimentation
(e.g., Statsig, Meta Deltoid, Optimizely) to measure software impact. 3 years of experience conducting coding
interviews and providing systematic feedback for engineering candidates. 2 years of experience with cloud-native
tools and infrastructure, such as Docker and Kubernetes. 2 years of experience defining and implementing data-
driven metrics to support company or team goals.
How to Apply: Submit resume and apply online at http://www.fireworks.ai/careers and search for job by title.