Director of Engineering, Cluster Networking

Nscale • US • 1w ago

.About Nscale

Nscale is the GPU cloud engineered for AI. We provide cost-effective, high-performance infrastructure for AI start-ups and large enterprise customers. Nscale enables AI-focused companies to achieve superior results by reducing the complexity of AI development. Our GPU cloud bolsters technical capabilities and directly supports strategic business outcomes, including cost management, rapid innovation, and environmental responsibility.

We thrive on a culture of relentless innovation, ownership, and accountability, where every team member takes pride in their work and drives it with excellence and urgency. As an Nscaler, you’ll build trust through openness and transparency, where everyone is inspired to do their best work. If you join our team, you’ll be contributing to building the technology that powers the future.

About The Role

We are seeking a Director of Engineering for Cluster Networking to lead the architecture, design, and engineering delivery of our global cluster networking infrastructure. This is a deeply technical leadership role responsible for building and operating high-performance, large-scale networking environments that underpin our GPU cloud platform and customer-facing AI workloads.

You will own the technical strategy and execution of cluster networking, including high-speed Ethernet and InfiniBand fabric design, data centre networking, and WAN interconnectivity. While this role includes elements of program and cross-functional coordination, it is fundamentally an engineering leadership position—responsible for architectural integrity, technical standards, operational excellence, and building a world-class networking engineering organization.

This is a high-impact role for a hands-on technical leader who thrives in fast-paced, ambiguous environments and is passionate about building resilient, high-performance infrastructure at global scale.

What You'll Be Doing

Technical Strategy & Architecture Leadership

Define and evolve the multi-year technical roadmap for cluster networking, aligning architecture with Nscale's AI platform
requirements and growth strategy.
Own the design and evolution of high-performance networking fabrics (Ethernet, InfiniBand) for GPU clusters and AI
workloads.
Establish and maintain reference architectures and engineering standards across all regions and data centres.
Make key architectural decisions on topology (e.g., Fat Tree, Rail), routing, congestion control, resiliency, and scalability.
Evaluate emerging technologies, vendors, and design approaches to ensure Nscale remains at the forefront of HPC and
cloud networking innovation.

Engineering Execution & Delivery

Lead end-to-end engineering delivery of cluster networking solutions—from design and lab validation to production
deployment and optimization.
Partner with deployment, hardware, and data centre teams to ensure accurate BOMs, scalable designs, and flawless
implementation.
Oversee capacity planning, performance modeling, and scaling strategies to meet rapid customer demand.
Ensure network changes and expansions are executed with minimal risk and zero customer impact wherever possible.
Introduce structured engineering lifecycle processes, including design reviews, validation gates, and post-incident analysis.

Operational Excellence & Reliability

Own the operational performance, availability, and reliability of cluster networking infrastructure globally.
Drive automation initiatives for provisioning, configuration management, monitoring, and remediation.
Champion best practices in observability, performance tuning, and fault isolation for large-scale AI clusters.
Lead incident response for major networking events, conducting root cause analysis and driving systemic improvements.
Establish clear SLAs, SLOs, and KPIs to measure and continuously improve network performance and resilience.

Cross-Functional & Program Collaboration

Collaborate closely with Compute, Platform, SRE, Data Centre Operations, and Procurement teams to ensure aligned
execution.
Provide technical leadership in cross-functional planning forums, ensuring networking requirements are clearly understood
and prioritized.
Support structured program planning and dependency management across regions and infrastructure initiatives.
Communicate architectural direction, risks, and trade-offs clearly to executive leadership and stakeholders.
Influence roadmap decisions through deep technical insight and data-driven analysis.

Team Leadership & Development

Build, mentor, and scale a high-performing cluster networking engineering team.
Foster a culture of engineering rigor, ownership, accountability, and continuous improvement.
Establish clear technical career pathways for engineers, including senior and principal-level growth tracks.
Lead recruitment efforts to attract top networking talent in highly competitive markets.
Act as a technical role model—setting high standards for design quality, documentation, and operational discipline.

Continuous Innovation & Improvement

Drive ongoing improvements in network efficiency, performance, and cost optimization.
Lead experimentation and benchmarking of new hardware, optics, firmware, and automation tooling.
Identify opportunities to reduce deployment time, increase reliability, and enhance customer performance outcomes.
Stay current with advancements in AI networking, HPC fabrics, and large-scale distributed systems.
Contribute to long-term infrastructure strategy as Nscale expands globally.

About You

Experience & Background

12+ years of experience in networking or infrastructure engineering, with at least 5 years in a senior technical leadership
role (Head of Engineering, Director, or equivalent).
Deep hands-on experience designing and operating large-scale data centre or HPC networking environments.
Proven expertise in high-speed Ethernet and/or InfiniBand fabrics supporting GPU or AI workloads.
Strong background in data centre networking, routing protocols, congestion management, and high-availability design.
Experience leading globally distributed engineering teams in high-growth or hyperscale environments.
Exposure to structured program delivery and cross-functional coordination in complex infrastructure initiatives.

Core Competencies

Technical depth: expert-level understanding of networking technologies (Ethernet, InfiniBand, RDMA, BGP, EVPN, VXLAN,
DC fabrics) and their application in AI/HPC environments.
Architectural leadership: ability to design scalable, resilient, high-performance cluster networking systems.
Engineering execution: strong command of engineering lifecycle processes, validation, and production operations.
Automation mindset: experience with infrastructure-as-code, network automation frameworks, and DevOps practices.
Reliability focus: demonstrated success building highly available, fault-tolerant infrastructure.
Data-driven decision making: skilled at performance analysis, capacity modeling, and using metrics to guide architectural
decisions.
Communication mastery: able to explain complex technical concepts clearly to both technical and non-technical audiences.
Stakeholder influence: comfortable driving alignment across engineering, operations, and executive leadership.

Skills & Attributes

Hands-on technical leader who remains close to architecture and critical technical decisions.
Structured thinker with strong systems-level perspective and attention to detail.
Comfortable operating in fast-paced, high-growth environments with evolving priorities.
Proactive problem-solver with a pragmatic, solutions-oriented mindset.
Passionate about building world-class AI infrastructure and high-performance networking systems.
Strong written and verbal communication skills; comfortable presenting technical strategy to senior leadership.
Collaborative and inclusive leadership style; ability to build trust and high-performing engineering teams.

Nice to Have

Experience designing networking for large-scale GPU clusters or AI training environments.
Familiarity with HPC networking topologies (Fat Tree, Rail, Dragonfly).
Experience with SONiC, Cumulus, or other open networking platforms.
Knowledge of optics, transceivers, and high-speed interconnect standards (400G/800G).
Experience working with hardware vendors and participating in technical evaluations or RFPs.
Background in SRE, distributed systems, or large-scale cloud infrastructure.
Experience with Palantir Foundry, data platforms, or operational analytics.

What We Can Offer You

At Nscale, you'll find a collaborative, supportive, and innovative environment where your contributions spark real impact. We're building something extraordinary, and we want you at the core.

Highly competitive package (base + equity) with reviews every 12 months. 🚀
Join the fastest-growing tech startup, your chance to push boundaries, collaborate with brilliant minds, and make your mark on cutting-edge AI. ✨
Expect a dynamic progression plan tailored to your ambitions. Grow by trying new things, leading, challenging the status quo, and owning your impact, always with our full support.
Human-First Flexibility: We treat you as humans first. 🫶🏽 Our flexible workplace trusts Nscalers to deliver, giving you the autonomy to shape your day around life's moments.

Join our thriving remote-first team. Geography is no barrier to impact or connection. We build seamless virtual collaboration, empowering you, wherever you work.

Equal Opportunities Statement

We strongly encourage applications from people of colour, the LGBTQ+ community, people with disabilities, neurodivergent people, parents, carers, and people from lower socio-economic backgrounds.

If there’s anything we can do to accommodate your specific situation, please let us know.

The responsibilities outlined in this job description are not exhaustive and are intended to provide a general overview of the position. The employee may be required to perform additional duties, tasks, and responsibilities as assigned by management, consistent with the skills and qualifications required for the role.