Engineering Manager, Accelerator Platform

Anthropic • San Francisco, CA | New York City, NY | Seattle, WA • 3w ago

About the Role

Every time someone talks to Claude -- through the API, claude.ai, our cloud partners, or any of our expanding surfaces -- the request lands on an AI accelerator. Not one kind, many kinds: TPUs, Trainium chips, GPUs. Each arrives with its own software stack, performance characteristics, failure modes, and operational quirks. Someone has to take raw silicon and turn it into a platform that the rest of Anthropic can build on without thinking about which chip is underneath. That's us.

The Accelerator Platform team owns the bringup and normalization of new hardware platforms for Anthropic's first party inference fleet. We sit between the low-level systems teams and the serving infrastructure that runs production inference -- bridging the gap so that every new accelerator generation ships as a first-class production platform. It's deeply technical work at the intersection of hardware enablement, distributed systems, and ML infrastructure, and it is directly on the critical path for Anthropic's compute strategy.

We're hiring an Engineering Manager to build and lead this team. You'll inherit a small nucleus of experienced engineers and grow it into a standalone platform organization. You'll set technical direction, hire a strong team, and partner closely with hardware vendors, cloud providers, and teams across Inference to bring new accelerator generations online quickly and reliably.

Responsibilities:

Build and lead the Accelerator Platform team -- hiring, developing, and retaining engineers who thrive at the hardware/software boundary
Own the end-to-end bring-up lifecycle for new accelerator platforms (multiple generations of Trainium, TPUs, and GPUs), from initial silicon availability through production-ready inference
Define and drive the platform normalization layer -- ensuring new hardware integrates cleanly with Anthropic's inference serving stack to provide a consistent abstractio
Partner with cloud providers (AWS, GCP, Microsoft Azure) and chip vendors on hardware roadmaps, capacity planning, and platform-specific technical challenges
Collaborate closely with teams across Inference and Infrastructure to ensure new platforms meet production reliability and latency requirements from day one
Contribute to Anthropic's multi-cloud compute strategy -- helping the organization maintain optionality across accelerator families and avoid lock-in to any single vendor
Manage the team's priorities across competing demands: new platform bring-up, ongoing production support for existing platforms, and longer-term investments in tooling and automation.

You may be a good fit if you:

Have significant experience managing infrastructure or platform engineering teams (3+ years in engineering management)
Have deep technical fluency in systems programming, distributed systems, or hardware/software co-design -- you need to understand the stack deeply enough to make sound technical and hiring decisions
Have experience bringing up or operating heterogeneous compute infrastructure at scale -- whether that's GPU clusters, TPU pods, custom ASICs, or FPGA deployments.
Are comfortable with ambiguity and can build structure where none exists. This team is being carved out as a new entity; you'll be defining its charter, processes, and culture from scratch
Think strategically about hardware roadmaps and can translate vendor capabilities into engineering plans
Build strong cross-functional relationships -- this role requires tight collaboration with hardware vendors, cloud partners, and half a dozen internal teams
Care deeply about both technical excellence and the people doing the work.

Strong candidates may also:

Have direct experience with ML accelerator architectures (GPU/CUDA, TPU/XLA, Trainium/Neuron, or similar)
Have worked on ML inference serving infrastructure at scale (1000+ accelerators)
Have experience with Kubernetes-based ML workload orchestration
Understand ML-specific networking (RDMA, InfiniBand, NVLink, ICI) and how interconnect topology affects serving performance
Have experience managing vendor relationships and influencing hardware/software roadmaps
Have led teams through rapid growth phases (hiring 5+ engineers in a short timeframe).

Apply