Staff Engineer, Distributed Storage,HPC & AI Infrastructure
Together AI · Amsterdam, Netherlands · Engineering
About this role
Together AI is hiring a staff-level AI Infrastructure Engineer in the machine learning function based in Amsterdam, Netherlands (hybrid). The posting calls out experience with Python, Kubernetes, Terraform, Ansible and roughly 8+ years of relevant work. Listed education preference: a bachelor's degree or equivalent.
- Role
- AI Infrastructure Engineer
- Function
- machine learning
- Level
- staff
- Track
- Tech leadership
- Employment
- Full-time
- Location
- Amsterdam, Netherlands
- Work mode
- Hybrid
- Experience
- 8+ years
- Education
- Bachelor's degree
- Department
- Engineering
More roles at Together AI
Job description
from Together AI careersAbout the Role
In this role, you will design and deliver multi-petabyte storage systems purpose-built for the world’s largest AI training and inference workloads. You’ll architect high-performance parallel filesystems and object stores, evaluate and integrate cutting-edge technologies such as WekaFS, Ceph, and Lustre, and drive aggressive cost optimization-routinely achieving 30-50% savings through intelligent tiering, lifecycle policies, capacity forecasting, and right-sizing.
You will also build Kubernetes-native storage operators and self-service platforms that provide automated provisioning, strict multi-tenancy, performance isolation, and quota enforcement at cluster scale. Day-to-day, you’ll optimize end-to-end data paths for 10-50 GB/s per node, design multi-tier caching architectures, implement intelligent prefetching and model-weight distribution, and tune parallel filesystems for AI workloads.
Hybrid Working 2 days a week at our offices in Amsterdam
Responsibilities
- Design multi-petabyte AI/ML storage systems; integrate WekaFS, Ceph, etc.; lead capacity planning and cost optimization (30-50% savings via tiering, lifecycle policies, right-sizing).
- Design/optimize RDMA, InfiniBand, 400GbE networks; tune for max throughput/min latency; implement NVMe-oF/iSCSI; troubleshoot bottlenecks; optimize TCP/IP for storage.
- Build Kubernetes storage operators/controllers; enable automated provisioning, self-service abstractions, multi-tenant isolation, quotas; create reusable Helm/Terraform patterns.
- Deliver 10-50 GB/s per GPU node; optimize caching (weights/datasets/checkpoints), parallel filesystems, and data paths; troubleshoot with profiling tools; scale to thousands of nodes.