Senior AI and HPC Observability Engineer
Nvidia · Santa Clara, CA
About this role
Nvidia is hiring a senior-level Infrastructure Engineer in the software engineering function based in Santa Clara, CA. The posting calls out experience with DevOps, Observability, Python, Java and roughly 5+ years of relevant work. Listed education preference: a bachelor's degree or equivalent.
- Role
- Infrastructure Engineer
- Function
- software engineering
- Level
- senior
- Track
- Individual contributor
- Employment
- Full-time
- Location
- Santa Clara, CA
- Experience
- 5+ years
- Education
- Bachelor's degree
- Posted
- Apr 20, 2026
More roles at Nvidia
Job description
from Nvidia careersNVIDIA is a pioneer in accelerated computing, known for inventing the GPU and driving breakthroughs in gaming, computer graphics, high-performance computing, and artificial intelligence. Our technology powers everything from generative AI to autonomous systems, and we continue to shape the future of computing through innovation and collaboration. Within this mission, our team, Managed AI Superclusters (MARS) builds and scales the infrastructure, platforms, and tools that enable researchers and engineers to develop the next generation of AI/ML systems. By joining us, you’ll help design solutions that power some of the world’s most advanced computing workloads.
Observability is at the heart of this transformation. We are looking for a strong AI & HPC Observability Engineer to build and scale next-generation Observability and Telemetry platforms. You will design and develop high-throughput, reliable telemetry pipelines and modern data infrastructure. This role requires solid distributed systems fundamentals, production-grade coding, and a passion for operational excellence.
What You Will Be Doing:
Design and scale observability platforms handling high-volume metrics, logs, and traces across distributed environments
Build high-performance backend services for telemetry ingestion, processing, and routing
Develop and extend OpenTelemetry collectors, processors, exporters, and instrumentation libraries
Build and optimize metrics pipelines using large-scale time-series storage systems
Design and operate real-time and batch telemetry pipelines using streaming and distributed data technologies
This is an excerpt. Read the full job description on Nvidia careers →