staff machine learning AI Infrastructure Engineer tech_leadership 8+ yrs Bachelor's Hybrid

Skills

Python Kubernetes Terraform Ansible Helm Prometheus Grafana Linux Networking Data Structures Encryption DevOps Observability Machine Learning Infrastructure as Code Performance Optimization Disaster Recovery Cloud Computing ArgoCD

About this role

Together AI is hiring a staff-level AI Infrastructure Engineer in the machine learning function based in Amsterdam, Netherlands (hybrid). The posting calls out experience with Python, Kubernetes, Terraform, Ansible and roughly 8+ years of relevant work. Listed education preference: a bachelor's degree or equivalent.

Role: AI Infrastructure Engineer
Function: machine learning
Level: staff
Track: Tech leadership
Employment: Full-time
Location: Amsterdam, Netherlands
Work mode: Hybrid
Experience: 8+ years
Education: Bachelor's degree
Department: Engineering

AI Summary

Design and operate multi-petabyte AI storage systems integrating WekaFS, Ceph, and Lustre. Build Kubernetes storage operators for automated provisioning and multi-tenancy. Optimize data paths for 10-50 GB/s per node and achieve 30-50% cost savings. Requires 8+ years storage engineering with 3+ years at multi-petabyte scale and proven GPU/HPC cluster experience.

Upgrade to Pro for AI summaries, resume match scores & career intelligence →

More roles at Together AI

Senior Software Engineer - Together Cloud Infrastructure

San Francisco, CA · senior

Go CUDA AWS

Senior Software Engineer - Together Cloud Platform

San Francisco, CA · senior

Go AWS GCP

Senior Technical Recruiter

San Francisco, CA · senior

LLMs Data Structures Cloud Computing

Solutions Architect

San Francisco, CA · mid

Python JavaScript Kubernetes

Sr. Partnerships Manager, Model Ecosystem

San Francisco, CA · senior

MongoDB LLMs Data Structures All Together AI jobs →

Job description

from Together AI careers

About the Role

In this role, you will design and deliver multi-petabyte storage systems purpose-built for the world’s largest AI training and inference workloads. You’ll architect high-performance parallel filesystems and object stores, evaluate and integrate cutting-edge technologies such as WekaFS, Ceph, and Lustre, and drive aggressive cost optimization-routinely achieving 30-50% savings through intelligent tiering, lifecycle policies, capacity forecasting, and right-sizing.

You will also build Kubernetes-native storage operators and self-service platforms that provide automated provisioning, strict multi-tenancy, performance isolation, and quota enforcement at cluster scale. Day-to-day, you’ll optimize end-to-end data paths for 10-50 GB/s per node, design multi-tier caching architectures, implement intelligent prefetching and model-weight distribution, and tune parallel filesystems for AI workloads.

Hybrid Working 2 days a week at our offices in Amsterdam

Responsibilities

Design multi-petabyte AI/ML storage systems; integrate WekaFS, Ceph, etc.; lead capacity planning and cost optimization (30-50% savings via tiering, lifecycle policies, right-sizing).
Design/optimize RDMA, InfiniBand, 400GbE networks; tune for max throughput/min latency; implement NVMe-oF/iSCSI; troubleshoot bottlenecks; optimize TCP/IP for storage.
Build Kubernetes storage operators/controllers; enable automated provisioning, self-service abstractions, multi-tenant isolation, quotas; create reusable Helm/Terraform patterns.
Deliver 10-50 GB/s per GPU node; optimize caching (weights/datasets/checkpoints), parallel filesystems, and data paths; troubleshoot with profiling tools; scale to thousands of nodes.

This is an excerpt. Read the full job description on Together AI careers →

All machine learning jobs machine learning in Amsterdam, Netherlands Jobs in Amsterdam, Netherlands machine learning salaries machine learning career path

All Together AI Jobs Browse machine learning roles staff positions

Staff Engineer, Distributed Storage,HPC & AI Infrastructure

About this role

More roles at Together AI

Job description

About the Role