staff machine learning AI Infrastructure Engineer tech_leadership 8+ yrs Bachelor's Hybrid

About this role

Together AI is hiring a staff-level AI Infrastructure Engineer in the machine learning function based in Amsterdam, Netherlands (hybrid). The posting calls out experience with Python, Kubernetes, Terraform, Ansible and roughly 8+ years of relevant work. Listed education preference: a bachelor's degree or equivalent.

Role
AI Infrastructure Engineer
Function
machine learning
Level
staff
Track
Tech leadership
Employment
Full-time
Location
Amsterdam, Netherlands
Work mode
Hybrid
Experience
8+ years
Education
Bachelor's degree
Department
Engineering
AI Summary
Design and operate multi-petabyte AI storage systems integrating WekaFS, Ceph, and Lustre. Build Kubernetes storage operators for automated provisioning and multi-tenancy. Optimize data paths for 10-50 GB/s per node and achieve 30-50% cost savings. Requires 8+ years storage engineering with 3+ years at multi-petabyte scale and proven GPU/HPC cluster experience.

More roles at Together AI

Senior Software Engineer - Together Cloud Infrastructure
San Francisco, CA · senior
Go CUDA AWS
Senior Software Engineer - Together Cloud Platform
San Francisco, CA · senior
Go AWS GCP
Senior Technical Recruiter
San Francisco, CA · senior
LLMs Data Structures Cloud Computing
Solutions Architect
San Francisco, CA · mid
Python JavaScript Kubernetes
Sr. Partnerships Manager, Model Ecosystem
San Francisco, CA · senior
MongoDB LLMs Data Structures
All Together AI jobs →

Job description

from Together AI careers

About the Role

In this role, you will design and deliver multi-petabyte storage systems purpose-built for the world’s largest AI training and inference workloads. You’ll architect high-performance parallel filesystems and object stores, evaluate and integrate cutting-edge technologies such as WekaFS, Ceph, and Lustre, and drive aggressive cost optimization-routinely achieving 30-50% savings through intelligent tiering, lifecycle policies, capacity forecasting, and right-sizing.

You will also build Kubernetes-native storage operators and self-service platforms that provide automated provisioning, strict multi-tenancy, performance isolation, and quota enforcement at cluster scale. Day-to-day, you’ll optimize end-to-end data paths for 10-50 GB/s per node, design multi-tier caching architectures, implement intelligent prefetching and model-weight distribution, and tune parallel filesystems for AI workloads.

Hybrid Working 2 days a week at our offices in Amsterdam

Responsibilities

  • Design multi-petabyte AI/ML storage systems; integrate WekaFS, Ceph, etc.; lead capacity planning and cost optimization (30-50% savings via tiering, lifecycle policies, right-sizing).
  • Design/optimize RDMA, InfiniBand, 400GbE networks; tune for max throughput/min latency; implement NVMe-oF/iSCSI; troubleshoot bottlenecks; optimize TCP/IP for storage.
  • Build Kubernetes storage operators/controllers; enable automated provisioning, self-service abstractions, multi-tenant isolation, quotas; create reusable Helm/Terraform patterns.
  • Deliver 10-50 GB/s per GPU node; optimize caching (weights/datasets/checkpoints), parallel filesystems, and data paths; troubleshoot with profiling tools; scale to thousands of nodes.
  • This is an excerpt. Read the full job description on Together AI careers →
All machine learning jobs machine learning in Amsterdam, Netherlands Jobs in Amsterdam, Netherlands machine learning salaries machine learning career path
All Together AI Jobs Browse machine learning roles staff positions