principal Product Manager ic · Posted May 14, 2026

About this role

Nvidia is hiring a principal-level Product Manager based in Santa Clara, CA. The posting calls out experience with DevOps, MLOps, Distributed Systems, AI Agents.

Role
Product Manager
Function
product
Level
principal
Track
Individual contributor
Employment
Full-time
Location
Santa Clara, CA
Posted
May 14, 2026

More roles at Nvidia

Senior Software Engineer, GoLang - DSX MaxQ
Santa Clara, CA · senior
Python Rust C
Senior ASIC Physical Design Engineer, Netlisting
Santa Clara, CA · senior
Python Deep Learning
System Software Integration Engineer
Remote (Germany) · mid
Python C Bash
Firmware Design Engineer
Yokneam, Israel · mid
Python Embedded Systems
Senior Engineer, Backend
Pune, India · principal
Python Java Go
All Nvidia jobs →

Job description

from Nvidia careers

NVIDIA is driving a vision for AI factories that convert tokens to intelligence at scale to power AI demands of tomorrow. Maintaining AI infrastructure at scale takes more than human involvement; it demands smart automation. The orchestration engine for AI factory break-fix runs live in production at DGX Cloud. As the Product Manager leading all aspects of resilient automation at AI Factory, you will manage break-fix automation. You will develop the product strategy, improve operator experience, and guide the roadmap for professionals. You will build a scalable, reliable product from a strong engineering foundation that NVIDIA Cloud Partners depend on to uphold their SLAs. This is your chance to compose how AI factories self-heal!

What You’ll Be Doing:

  • Take full responsibility for the strategic direction and roadmap of the break-fix automation system spanning multiple vendors, technologies, and CSPs.

  • Define automation confidence thresholds, blocking issue criteria, and human-in-the-loop intervention points that balance speed with operational safety.

  • Build the operator UX for repair queues, workflow transparency, and audit trails — ensuring on-call engineers have the context they need to act quickly and confidently.

  • Drive the integration between failure attribution and automated repair actions, following through from detection to resolution.

  • Define repair SLOs and own the metrics framework for time-to-drain, time-to-healthy, and overall fleet availability.

    This is an excerpt. Read the full job description on Nvidia careers →
All product jobs product in Santa Clara, CA Jobs in Santa Clara, CA product salaries product career path
All Nvidia Jobs Browse product roles principal positions