Distributed Software Engineer

Cerebras Systems • Bengaluru, Karnataka, India; Sunnyvale CA or Toronto Canada • 1w ago

About The Role

Cerebras Systems is a pioneer in large-scale AI Supercomputers. These multi-exaflop supercomputers are deployed in some of the biggest datacenters. These supercomputers are built using our Wafer-Scale Cluster technology - a cluster of several Wafer Scale Engine (WSE) chips. The Cluster engineering team is responsible for delivering software that are all-things related to cluster.

Responsibilities

Automate bare-metal configuration of networking, OS, and application software in large clusters of Cerebras WSE, servers, and switches.
Additional push button workflows for cluster upgrades, downgrades, and security patching with key metrics to minimize downtime on clusters.
An orchestration and scheduler system for resource allocation, job submission C placements for a multi-user environment on a cluster.
Seamless support for both on-premise and cloud mode deployment and operations.
A robust system for monitoring, detecting and handling failures for a variety of resources on the clusters (including High Availability of clusters).
Broad cluster and job monitoring and visualization capabilities, along with alerting systems.
User facing tools to monitor the status of jobs and collect metrics.
Administrator facing tools to manage and operate large clusters.

Skills & Qualifications

Strong track record of software architecture, system design and development.
Strong track record of development in distributed cluster.
Strong understanding of Kubernetes (K8s) software ecosystem, Prometheus and Grafana.
Strong development skills in GoLang, Python, bash.
Strong debugging skills with distributed systems.
Strong skill to develop tests for the new features and regress old features.