NVIDIA is at the forefront of the AI revolution, and the AIOps department is critical to ensuring our AI-driven data centers operate with unmatched efficiency. We are looking for a visionary, hands-on Software Engineering Manager to lead a team building the next generation of AI-based monitoring and operation platforms.
This role focuses on leveraging AI Agents to automate, predict, and optimize data center performance at an internet scale. If you are a resilient leader who excels in fast-paced environments and has a passion for autonomous system operations, we want you on our team.
What You’ll Be Doing:
Strategic Roadmap Development: Define software design and implementation roadmaps for AI-driven operations, ensuring data center availability, resiliency, and performance through autonomous agent-based monitoring.
Innovative AIOps Engineering: Lead the development of tools and proof-of-concepts focused on software-defined operations, utilizing AI agents to automate root cause analysis and proactive remediation.
Scalable Architecture: Build and scale monitoring applications that handle massive telemetry data from AI infrastructure across public, private, and hybrid cloud environments.
Agentic Frameworks: Oversee the integration of LLM-based agents into CI/CD and operational workflows to shift from reactive monitoring to predictive orchestration.
Team Leadership: Actively hire, mentor, and grow a high-performing engineering team, fostering a culture of technical excellence and creative problem-solving.
Customer Engagement: Directly contribute to internal and external customer engagements to align AIOps solutions with real-world data center challenges.
What We Need to See:
BS/MS degree in Computer Science or a related technical field (or equivalent experience).
8+ years of overall software engineering experience, with at least 2+ years in a management or technical lead role.
Domain Expertise: 3+ years of experience in system software engineering for large-scale production systems, with a strong background in Solution Design and Distributed Systems.
Cloud Native Mastery: Deep experience with Docker and Kubernetes orchestration, alongside PaaS or IaaS cloud platforms.
Programming Proficiency: Strong programming skills in Python (essential for AI/ML workflows) and Go.
Operational Intelligence: Extensive knowledge of CI/CD pipelines and automated software-defined operations.
Exceptional written and verbal communication skills to bridge the gap between complex AI logic and operational requirements.
Ways to Stand Out from the Crowd:
AI/ML Background: Experience building or deploying AI Agents (LangChain, AutoGPT) or using ML models for anomaly detection and predictive analytics.
Infrastructure Knowledge: Familiarity with Ethernet switching, networking protocols, or NVIDIA’s hardware stack (GPUs/DPUs).
Control Systems: Experience in developing autonomous systems or closed-loop feedback monitoring tools.
SaaS Background: Proven track record of managing and scaling cloud-based SaaS applications.