About The Role
As a Compute / Server Platform Architect on the Cluster Architecture Team, you will own the server-side platform architecture that enables Cerebras CS3-based AI clusters (training and inference) to deliver predictable performance, scalability, and reliability. Our accelerators are network-attached, so the x86 server fleet is a first-class part of the end-to-end system: it runs critical-path runtime functions (for example orchestration, prompt caching, and IO/control services) and must be co-designed with software for token-level latency, throughput, and cost efficiency. You will translate workload behavior into CPU, memory, IO, PCIe, and host-networking requirements, drive platform evaluations with vendors, and provide technical leadership through qualification and production adoption in close partnership with other function leaders and TPMs.
Responsibilities
- Own the architecture for all server roles in Cerebras clusters, including definitions of server types, configurations, and lifecycle strategy.
- Define and maintain server formulas (counts and ratios per CS-3 count, cluster size, and workload type) including capacity planning and headroom policy.
- Specify platform configurations: CPU SKU and core strategy, our vendor roadmap (e.g., AMD, Intel, ARM), memory topology (channels, DIMM type, capacity), PCIe topology and lane budgeting, NIC selection/placement, and local NVMe policy where applicable.
- Translate software and runtime flows into measurable hardware requirements (CPU utilization, memory bandwidth/latency, bursty IO patterns, queueing and concurrency limits) and communicate clear guardrails back to software teams.
- Develop performance and scaling models; validate with microbenchmarks and workload-level experiments; identify bottlenecks and drive cross-stack fixes.
- Define the OS, BIOS, firmware, and driver baseline for each server type; there are other teams that follow these recommendations and apply them on our fleet.
- Stay current on emerging server technologies (CPU generations, new memory technologies, CXL, NVMe evolutions, SmartNIC/DPU capabilities where relevant) and run proof-of-concept evaluations to determine when to adopt.
- Lead technical vendor engagements (OEM/ODM and component vendors): influence roadmap, request platform knobs, and drive joint debugging on performance or reliability issues.
- Define qualification and acceptance criteria (performance, stability, operability) and partner with the Infrastructure Hardware TPM to execute qualification plans and land changes cleanly into production.
- Support bring-up and rare deployment debugging in lab and staging environments; drive root-cause analysis for regressions spanning firmware, drivers, OS, and runtime behavior.
Skills and Qualifications
- PhD. in Computer Science or Electrical/Computer Engineering and + 8 years industry experience, or Master’s/Bachelor’s in CS or EE + 10 years industry experience.
- 5+ years of experience in server platform architecture, systems performance engineering, or large-scale infrastructure design for AI/ML, HPC, or performance-sensitive distributed systems.
- Deep understanding of x86 server architecture: CPU microarchitecture basics, cache hierarchies, NUMA, memory controllers/channels, and memory bandwidth vs latency tradeoffs.
- Strong Linux systems knowledge: profiling and performance analysis, scheduling and syscall overheads, memory management behavior, and practical tuning methodology.
- Experience reasoning about high-performance IO paths, including NIC behavior at a systems level, RDMA/RoCE concepts, and NVMe performance characteristics.
- Proven ability to create capacity and performance models and validate them empirically with a rigorous benchmarking plan.
- Experience working directly with vendors/partners to evaluate platforms, drive issue resolution, and influence roadmaps.
- Strong cross-functional communication skills and ability to drive technical decisions through clear tradeoff documents and reviews.
- Familiarity with application and system software (C, C++, Python).