About Nscale
Nscale is taking on the hyperscalers by building a vertically integrated GenAI cloud platform. We own the data centres, software, and applications that power today's AI applications using sustainable technology solutions. We thrive on a culture of relentless innovation, ownership, and accountability, where every team member takes pride in their work and drives it with excellence and urgency. As an Nscaler, you'll build trust through openness and transparency, where everyone is inspired to do their best work. Collaboration is key, and we work together swiftly and respectfully, embracing adaptability and resilience in all we do.
About the Role
We're hiring a Director of Fleet Management to lead the team that keeps Nscale's bare-metal GPU fleet running. This is a hands-on role where you will be a core individual contributor as well as a leader of people and technology: you'll write production Python deployed using Helm and Kubernetes, design distributed systems, and steer architecture - while also hiring, mentoring, and driving delivery for a growing engineering team.
Fleet Manager automates the entire operational lifecycle of our compute infrastructure from initial device enrolment through multi-day burn-in testing, to ongoing health monitoring and automated remediation. The problems are challenging and the stakes are high: the software you design and build will determine Nscale’s success in scaling its GPU fleet to meet demand, and put you at the centre of some of the highest-impact work in the company.
What you'll work on
- Large scale business-critical automation that configures BMCs, manages DHCP reservations, drives bare-metal provisioning state machines, runs GPU burn-in tests and remediation workflows.
- Complex workflow orchestration and event-driven state machines that span multiple days, survive crashes, resume from checkpoints, support human-in-the-loop approval gates, and let thousands of concurrent idempotent workflows operate without stepping on each other.
- Multi-site hub and spoke infrastructure tooling that works across geographically distributed data centres with independent trust boundaries.
- Integration and ensuring consistency with data-centre inventory management tooling (DCIM), bare-metal provisioning systems, credential stores and monitoring infrastructure.
- Observability: structured logging, metrics, distributed tracing and tooling that lets operators troubleshoot effectively.
What you'll lead
- A team of highly talented software engineers, from the front, building hardware lifecycle automation.
- The technical roadmap and architecture for how Nscale provisions, validates, monitors, and remediates hardware at massive scale.
- Writing code in critical areas of the codebase, shipping to production regularly, and setting the bar for execution: getting things done.
- Engineering standards: code review, testing, CI/CD, incident response, and on-call practices.
- Tight collaboration with Product, Infrastructure, Platform, SRE, and UI/UX to capture requirements early, align on interfaces, and ship integrations that meet operator needs.
- Hiring and developing engineers who thrive in a high-autonomy, high-accountability environment.
About You
- 10+ years building, owning, and operating complex distributed systems, with at least 2 years leading engineering teams.
- Hands-on experience with workflow orchestration (Temporal, Airflow, Prefect, or similar).
- Bare-metal expertise across compute, networking, and storage: BMC/IPMI/Redfish, PXE boot, DHCP, VLAN management, and provisioning systems like Ironic, MAAS, or equivalent.
- Confidence working at the intersection of software and physical infrastructure - debugging sometimes means asking "is the cable plugged in?"
- You've built systems that had to be fault-tolerant, resumable, and observable (so failures don’t turn into 3am pages).
- You stay effective while context-switching between deep work, judgement calls, and people leadership - writing a workflow activity, reviewing an ADR, and unblocking a team member in the same morning.
- Use of AI as a force multiplier: to speed up specs, scaffolding, tests, refactors, data exploration, incident triage, and docs with modern AI tools.
Ways to stand out
- You've worked with OpenStack Ironic, NetBox, or similar data centre inventory and management platforms.
- You’ve used HPC workload schedulers like SLURM
- You've designed multi-site architectures for infrastructure tooling.
- You've built hardware burn-in, validation, or remediation automation.
- You’ve owned results storage, analysis, and reporting for large-scale computational testing. Experience with HPC simulations or ML training is a plus.
What We Can Offer You
At Nscale, you'll find a collaborative, supportive, and innovative environment where your contributions spark real impact. We're building something extraordinary, and we want you at the core.
- Highly competitive package (base + equity) with reviews every 12 months. 🚀
- Join the fastest-growing tech startup, your chance to push boundaries, collaborate with brilliant minds, and make your mark on cutting-edge AI. ✨
- Expect a dynamic progression plan tailored to your ambitions. Grow by trying new things, leading, challenging the status quo, and owning your impact, always with our full support.
- Human-First Flexibility: We treat you as humans first. 🫶🏽 Our flexible workplace trusts Nscalers to deliver, giving you the autonomy to shape your day around life's moments.
Join our thriving remote-first team. Geography is no barrier to impact or connection. We build seamless virtual collaboration, empowering you, wherever you work.
Equal Opportunities Statement
At NScale, we are committed to fostering an inclusive, diverse, and equitable workplace. We believe that a variety of perspectives enriches our work environment, and we encourage applications from candidates of all backgrounds, experiences, and abilities. We strongly encourage applications from people of colour, the LGBTQ+ community, people with disabilities, neurodivergent people, parents, carers, and people from lower socio-economic backgrounds.
If there’s anything we can do to accommodate your specific situation, please let us know.
For information on how Nscale handles candidate personal data, please see our Employee & Candidate Privacy Notice: Here.