Training: ML Framework Engineer
OpenAI · San Francisco, CA · Scaling
About this role
OpenAI is hiring a mid-level ML Platform Engineer in the machine learning function based in San Francisco, CA (hybrid). The posting calls out experience with Python, Machine Learning, Distributed Systems, Observability. Compensation is listed at $205,000–$445,000 per year.
- Role
- ML Platform Engineer
- Function
- machine learning
- Level
- mid
- Track
- Individual contributor
- Employment
- Full-time
- Location
- San Francisco, CA
- Work mode
- Hybrid
- Department
- Scaling
- Posted
- Oct 29, 2025
More roles at OpenAI
Job description
from OpenAI careersAbout the Team
Training Runtime designs the core distributed machine-learning training runtime that powers everything from early research experiments to frontier-scale model runs. With a dual mandate to accelerate researchers and enable frontier scale, we’re building a unified, modular runtime that meets researchers where they are and moves with them up the scaling curve.
Our work focuses on three pillars: high-performance, asynchronous, zero-copy tensor and optimizer-state-aware data movement; performant, high-uptime, fault-tolerant training frameworks (training loop, state management, resilient checkpointing, deterministic orchestration, and observability); and distributed process management for long-lived, job-specific and user-provided processes.
We integrate proven large-scale capabilities into a composable, developer-facing runtime so teams can iterate quickly and run reliably at any scale, partnering closely with model-stack, research, and platform teams. Success for us is measured by raising both training throughput (how fast models train) and researcher throughput (how fast ideas become experiments and products).
About the Role
As a Training: ML Framework Engineer, you will work on improving the training throughput for our internal training framework, while enabling researchers to experiment with new ideas. This requires good engineering (for example designing, implementing, and optimizing state-of-the-art AI models), writing bug-free machine learning code (surprisingly difficult!), and acquiring deep knowledge of the performance of supercomputers. In all the projects this role pursues, the ultimate goal is to push the field forward.