About The Role
Cerebras Systems is a pioneer in large-scale AI Supercomputers. These multi-exaflop supercomputers are deployed in some of the biggest datacenters. These supercomputers are built using our Wafer-Scale Cluster technology - a cluster of several Wafer Scale Engine (WSE) chips. The Cluster engineering team is responsible for delivering software that are all-things related to cluster.
Responsibilities
- Automate bare-metal configuration of networking, OS, and application software in large clusters of Cerebras WSE, servers, and switches.
- Additional push button workflows for cluster upgrades, downgrades, and security patching with key metrics to minimize downtime on clusters.
- An orchestration and scheduler system for resource allocation, job submission C placements for a multi-user environment on a cluster.
- Seamless support for both on-premise and cloud mode deployment and operations.
- A robust system for monitoring, detecting and handling failures for a variety of resources on the clusters (including High Availability of clusters).
- Broad cluster and job monitoring and visualization capabilities, along with alerting systems.
- User facing tools to monitor the status of jobs and collect metrics.
- Administrator facing tools to manage and operate large clusters.
Skills & Qualifications
- Strong track record of software architecture, system design and development.
- Strong track record of development in distributed cluster.
- Strong understanding of Kubernetes (K8s) software ecosystem, Prometheus and Grafana.
- Strong development skills in GoLang, Python, bash.
- Strong debugging skills with distributed systems.
- Strong skill to develop tests for the new features and regress old features.