Senior Software Engineer - Together Cloud Infrastructure
Together AI · San Francisco, CA · Engineering
About this role
Together AI is hiring a senior-level Infrastructure Engineer in the software engineering function based in San Francisco, CA. The posting calls out experience with Go, CUDA, AWS, GCP and roughly 5+ years of relevant work. Compensation is listed at $160,000–$230,000 per year.
- Role
- Infrastructure Engineer
- Function
- software engineering
- Level
- senior
- Track
- Individual contributor
- Employment
- Full-time
- Location
- San Francisco, CA
- Experience
- 5+ years
- Department
- Engineering
More roles at Together AI
Job description
from Together AI careersAbout the Role
Together AI is building the AI Acceleration Cloud, an end-to-end platform for the full generative AI lifecycle, combining the fastest LLM inference engine with state-of-the-art AI cloud infrastructure.
As a Senior AI Infrastructure Engineer, you will play a key role in building the next generation AI cloud platform – a highly available, global, blazing-fast cloud infrastructure that virtualizes cutting-edge ML hardware (GB200s/GB300s, BlueField DPUs) and enables state-of-the-art ML practitioners with self-serve AI cloud services, such as on-demand + managed Kubernetes and Slurm clusters. This platform serves both our internal SaaS products (inference, fine-tuning) and our external cloud customers, spanning dozens of data centers across the world.
Responsibilities
- Design, build, and maintain performant, secure, and highly-available backend services/operators that run in our data centers and automate hardware management, such as Infiniband partitioning, in-DC parallel storage provisioning, and VM provisioning.
- Design and build out the IaaS software layer for a new GB200 data center with thousands of GPUs.
- Work on a global multi-exabyte high-performance object store, serving massive datasets for pretraining.
- Build advanced observability stacks for our customers with automated node lifecycle management for fault-tolerant distributed pretraining.
- Perform architecture and research work for decentralized AI workloads
- Work on the core, open-source Together AI platform