Job Description
<h3>📋 Description</h3> • Manage and maintain Ray clusters deployed on GCP/GKE to support distributed LLM training and inference.
• Optimize multi-node, multi-GPU workloads for both fine-tuning and inference pipelines using Ray, Kubernetes, and GCP services.
• Assist the research team with environment debugging, dependency management, and containerization (e.g., CUDA/PyTorch/Flash-Attn stacks).
• Build and maintain reusable infrastructure templates (e.g., Terraform modules, Helm charts) for reproducible research environments.
• Monitor system performance and optimize cluster resource allocation and autoscaling.
• Support CI/CD workflows for experiment tracking and deployment pipelines.
• Collaborate with research engineers to improve the usability, reliability, and scalability of our training infrastructure. <h3>🎯 Requirements</h3> • 3+ years experience in DevOps/MLOps roles with a focus on machine learning infrastructure.
• Solid hands-on experience with Ray, Kubernetes (GKE preferred), and multi-GPU orchestration.
• Proficiency with GCP services (Compute Engine, GCS, IAM, VPC, etc.).
• Strong working knowledge of Python and shell scripting.
• Experience managing CUDA-based environments for training and inference with PyTorch.
• Familiarity with containerization (Docker) and environment isolation (Conda, virtualenv).
• Experience with IaC tools (Terraform, Helm).
• Strong troubleshooting skills in distributed environments (networking, storage, job failures, etc.).
• Experience with LLM training, LoRA fine-tuning, or RLHF pipelines.
• Familiarity with FlashAttention, DeepSpeed, FSDP, or other large-scale model optimization techniques.
• Knowledge of CI/CD tools (GitHub Actions, ArgoCD) and experiment tracking (e.g., MLflow, Weights & Biases).
• Exposure to event-driven compute or serverless functions on GCP.
• Ability to write clean internal tooling (e.g., dashboards, CLI utilities). <h3>🏖️ Benefits</h3> • Amazing work culture (Super collaborative & supportive work environment; 5 days a week)
• Awesome colleagues (Surround yourself with top talent from Meta, Google, LinkedIn etc. as well as people with deep startup experience)
• Competitive compensation
• Flexible working hours
• Full-time remote opportunity