About Fundamental
Fundamental is an AI company pioneering the future of enterprise decision-making. Founded by DeepMind alumni, Fundamental has developed NEXUS – the world's most powerful Large Tabular Model (LTM) – purpose-built for the structured records that actually drive enterprise decisions. Backed by world class investors and trusted by Fortune 100 companies, Fundamental unlocks trillions of dollars of value by giving businesses the Power to Predict.
At Fundamental, you'll work on unprecedented technical challenges in foundation model development and build technology that transforms how the world's largest companies make decisions. This is your opportunity to be part of a category-defining company from the ground-up. Join the team defining the future of enterprise AI.
Key responsibilities
Develop and manage scalable, automated machine learning pipelines, CI/CD workflows, and orchestration frameworks
Design and implement robust model serving infrastructure using platforms like TorchServe, TensorFlow, Triton etc.
Develop scalable inference architectures optimized, with ultra-low latency and high throughput
Ensure seamless model deployment by implementing A/B testing, canary releases, and rollback capabilities
Develop logging, alerting, and monitoring solutions to track model development, and reliability
Improve GPU usage, enable autoscaling, and streamline resource allocation to boost efficiency
Design, implement, and maintain feature stores, robust data pipelines, and scalable storage solutions to efficiently handle large volumes of data
Must have
Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field (or equivalent practical experience)
5+ years of experience as MLOps engineer or DevOps roles, working with MLOps platforms (MLflow, WandB etc..) and frameworks (PyTorch, TensorFlow etc..)
Experience building and designing MLOps infrastructure from the ground up
Experience with model serving frameworks (TorchServe, TensorFlow Serving, Triton, KServe etc..) for high scalability and low latency inference
Experience in building and managing data pipelines to support both model training and inference
Experience with Kubernetes on a major cloud provider (AWS, GCP, or Azure) and with infrastructure as code (e.g. Terraform, Helm, GitOps)
Strong software engineering skills in Python, Bash, and Go, with a focus on writing clean, maintainable, and scalable code
Experience in AI/ML systems security, compliance, and model governance
Proficient with observability and monitoring tools, such as Prometheus, Grafana, Datadog, and OpenTelemetry
Nice to have
Experience with ML workflow tooling (MLflow, Kubeflow, or similar)
Experience with FastAPI and Backend applications
Familiarity with data platforms like Databricks or Snowflake
Exposure to SRE practices or cloud security certifications
Hands-on experience with Prometheus, Grafana, or Datadog
Benefits
Competitive compensation with salary and equity
Comprehensive health coverage, including medical, dental, vision, and 401K
Fertility support, as well as paid parental leave for all new parents, inclusive of adoptive and surrogate journeys
Relocation support for employees moving to join the team in one of our office locations
A mission-driven, low-ego culture that values diversity of thought, ownership, and bias toward action