About Fundamental
Fundamental is an AI company pioneering the future of enterprise decision-making. Founded by DeepMind alumni, Fundamental has developed NEXUS – the world's most powerful Large Tabular Model (LTM) – purpose-built for the structured records that actually drive enterprise decisions. Backed by world class investors and trusted by Fortune 100 companies, Fundamental unlocks trillions of dollars of value by giving businesses the Power to Predict.
At Fundamental, you'll work on unprecedented technical challenges in foundation model development and build technology that transforms how the world's largest companies make decisions. This is your opportunity to be part of a category-defining company from the ground-up. Join the team defining the future of enterprise AI.
Key responsibilities
Lead and mentor a team of DevOps engineers, fostering technical growth and collaboration
Define and drive the infrastructure roadmap aligned with company objectives
Architect and oversee cloud infrastructure design and implementation
Establish best practices, standards, and processes for infrastructure development and operations
Partner with Engineering, Research, and FDE to align infrastructure capabilities with business needs
Drive the evolution of Kubernetes clusters optimized for GPU workloads, Production SaaS hosting and and varied enterprise deployment models
Champion GitOps practices using ArgoCD for continuous deployment
Establish infrastructure as code standards using Terraform
Define monitoring and observability strategy for distributed systems
Collaborate with ML engineers to optimize infrastructure for model training and serving
Own infrastructure reliability, performance, and security posture
Implement and maintain cost optimization strategies (FinOps) for cloud resources
Must have
7+ years of experience in cloud infrastructure and DevOps, with 3+ years in a technical leadership role
Proven track record of building and leading high-performing infrastructure teams
Strong experience with AWS, GCP and Azure
Deep expertise in Kubernetes, including multi-cluster management, GPU workload optimization, resource scheduling and autoscaling, and network policies and security
Extensive experience with cloud networking, including VPC design, load balancer configuration, network security and segmentation, and cross-cloud networking solutions
Strong CI/CD expertise, preferably with GitHub Actions
Proficiency in Terraform
Proficiency with GitOps tools (ArgoCD preferred)
3+ years of experience with Python
Experience with monitoring and observability tools
Experience with FinOps practices and cloud cost optimization
Excellent communication skills with ability to translate technical concepts for diverse audiences
Nice to have
Experience with ML workflow tooling (MLflow, Kubeflow, or similar)
Experience with FastAPI and backend applications
Familiarity with data platforms like Databricks or Snowflake
SRE practices experience or cloud security certifications
Hands-on experience with Prometheus, Grafana, or Datadog
Experience scaling infrastructure for AI/ML startups
Benefits
Competitive compensation with salary and equity
Comprehensive health coverage, including medical, dental, vision, and 401K
Fertility support, as well as paid parental leave for all new parents, inclusive of adoptive and surrogate journeys
Relocation support for employees moving to join the team in one of our office locations
A mission-driven, low-ego culture that values diversity of thought, ownership, and bias toward action