We are looking for a Senior QA and Automation Engineer to join the NMX team.
NVIDIA NMX is an integrated platform for management, monitoring, and analytics of cloud telemetry in large-scale GPU and NVLink-based data centers. It includes NMX Telemetry (NMX‑T) for collecting and aggregating telemetry, NMX Controller (NMX‑C) for configuring and controlling network and fabric components, and NMX Manager (NMX‑M) for analytics, health monitoring, and policy-driven automation. As part of the NMX QA group, you will own the quality of a distributed, cloud-scale management and telemetry platform that sits at the heart of NVIDIA’s next-generation AI data centers.
What you'll be doing:
Design, develop, and execute end-to-end tests for new NMX features as part of GA and maintenance releases.
Plan and implement test automation for APIs, services, and data pipelines (REST/gRPC, telemetry collection, control-plane flows) including test infrastructure and reusable libraries.
Build and maintain regression suites for functional, performance, scale, and resiliency scenarios across NVOS-based switches, GPU systems, and cloud environments.
Integrate and validate with 3rd‑party and platform components, such as Linux, NVOS/NVLink switches, networking stacks, containers and orchestration environments.
Investigate complex issues across multiple services: reproduce bugs, analyze logs and telemetry, collaborate closely with development and architecture teams to isolate root causes, and verify fixes.
Contribute to observability of the product (metrics, logs, health checks, dashboards) to improve testability, debuggability, and production-readiness.
What we need to see:
Practical / B.A. / B.Sc. in Computer Science, Electrical Engineering, or equivalent experience.
5+ years of hands-on QA / test automation experience in backend, distributed, or networking systems.
Strong programming/scripting skills, 5+ years with at least one of: Python (preferred), Bash, or similar for automation and tooling.
Solid networking and system background (3+ years): TCP/IP, L2/L3, data center networking and/or fabric technologies.
Strong Linux fundamentals (shell, processes, networking, system debugging).
Proven ability to work independently and end-to-end: from test design through automation and execution to reporting.
Excellent communication and interpersonal skills, comfortable working with multi-site R&D and architecture teams.
Ways to stand out from the crowd:
Experience with telemetry / monitoring / observability platforms (e.g., Prometheus, Grafana, OpenTelemetry, Kafka, time-series databases).
Experience in HPC or large-scale AI/data center environments, or with fabric management solutions.
Proven experience designing automation infrastructure (frameworks, reusable libraries, CI integration) for distributed systems.
Hands-on experience with containers and orchestration (Docker, Kubernetes, Nomad, Consul) and CI/CD pipelines.
Familiarity with NVLink / InfiniBand / high-speed networking concepts.
NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us. If you're creative and autonomous, we want to hear from you!
NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.