Deep Learning Kernel Software Performance Architect

CN05 NVIDIA Shanghai WFOE • Full-time • China, Shanghai • 22h ago

NVIDIA is seeking Software Performance Architects to optimize GPU kernel performance for state-of-the-art data-center platforms. We build automated, data-driven workflows to detect, explain, and prevent performance regressions across key deep learning workloads, partnering closely with kernel developers, compiler teams, infrastructure, and architecture/performance groups.

What you'll be doing:

Performance analysis + debugging
- Validate and analyze performance of GPU-accelerated kernels and key deep learning building blocks.
- Debug performance issues end-to-end: reproduce, isolate root causes, propose fixes or mitigation paths, and drive closure with the owning teams.
- Build performance narratives using structured evidence: baselines, controlled comparisons, and regression attribution.
Automation + regression infrastructure (Python-heavy)
- Develop and maintain Python-based automation for performance testing and analysis—using modern AI-assisted developer tools (e.g., Cursor/Claude Code/Copilot) to accelerate scripting while keeping code maintainable and reviewable.
- Design and operate performance test workflows: coverage definition, test/workload generation, automated large-scale execution (CI/nightly/on-demand), rerun rules, and reproducibility standards.
- Convert raw run outputs into actionable insight: statistics, noise control, post-processing, visualization, and large-scale result mining.
Cross-team collaboration and operating model
- Work with kernel developers and compiler/rotation teams to ensure performance checks are practical, scalable, and aligned to release needs.
- Partner with SWQA and infrastructure teams for execution at scale and reliable pipelines/dashboards.
- Contribute to clear ownership/triage/routing rules so regressions close quickly and consistently
Following general software engineering best practices including support for regression testing and CI/CD flows

What we need to see:

Masters or PhD degree or equivalent experience in Computer Science, Computer Engineering, Applied Math, or related field
Strong programming ability in Python plus C/C++ (performance-oriented code reading/debugging)
Solid fundamentals in computer architecture and performance reasoning (latency/throughput, memory hierarchy, parallelism).
Experience with performance analysis workflows: profiling, measurement methodology, reproducibility, and regression triage.
Comfortable working across teams and driving issues to decision/closure with clear communication
Demonstrated strong C++ programming and software design skills, including debugging, performance analysis, and test design
Experience with performance-oriented parallel programming, even if it’s not on GPUs (e.g. with OpenMP or pthreads)
Solid understanding of computer architecture and some experience with assembly programming
Identify bottlenecks, optimize resource utilization, and improve throughput

Ways to stand out from the crowd:

Experience with high-performance kernels or math libraries (e.g., GEMM/attention, CUTLASS-like concepts)
Experience building CI/nightly regression systems, dashboards, or large-scale performance analytics
GPU programming/perf experience (CUDA or equivalent parallel programming)
Strong ML/DL workload understanding (training/inference shapes, precision modes, perf bottlenecks)
Familiarity with simulators/analytical modeling or performance characterization methodology