Back to Jobs

MLOps Engineer

Turing
🇧🇷 Brazil – Remote
Full-time
$125K–$157K
Estimated
Remote
Apply Now

Required Skills

⏰ Full Time
🟡 Mid-level
🟠 Senior
🤖 Machine Learning Engineer
Docker
Flash
Google Cloud Platform
Kubernetes
Node.js
Python
PyTorch
Ray
Shell Scripting
Terraform
Machine Learning
Llm
Rag
R
Go
Scala
Pytorch
Gcp
Mlops
Github Actions
Mlflow
Weights & Biases
Git
Research

Job Description

<h3>📋 Description</h3> • Manage and maintain Ray clusters deployed on GCP/GKE to support distributed LLM training and inference. • Optimize multi-node, multi-GPU workloads for both fine-tuning and inference pipelines using Ray, Kubernetes, and GCP services. • Assist the research team with environment debugging, dependency management, and containerization (e.g., CUDA/PyTorch/Flash-Attn stacks). • Build and maintain reusable infrastructure templates (e.g., Terraform modules, Helm charts) for reproducible research environments. • Monitor system performance and optimize cluster resource allocation and autoscaling. • Support CI/CD workflows for experiment tracking and deployment pipelines. • Collaborate with research engineers to improve the usability, reliability, and scalability of our training infrastructure. <h3>🎯 Requirements</h3> • 3+ years experience in DevOps/MLOps roles with a focus on machine learning infrastructure. • Solid hands-on experience with Ray, Kubernetes (GKE preferred), and multi-GPU orchestration. • Proficiency with GCP services (Compute Engine, GCS, IAM, VPC, etc.). • Strong working knowledge of Python and shell scripting. • Experience managing CUDA-based environments for training and inference with PyTorch. • Familiarity with containerization (Docker) and environment isolation (Conda, virtualenv). • Experience with IaC tools (Terraform, Helm). • Strong troubleshooting skills in distributed environments (networking, storage, job failures, etc.). • Experience with LLM training, LoRA fine-tuning, or RLHF pipelines. • Familiarity with FlashAttention, DeepSpeed, FSDP, or other large-scale model optimization techniques. • Knowledge of CI/CD tools (GitHub Actions, ArgoCD) and experiment tracking (e.g., MLflow, Weights & Biases). • Exposure to event-driven compute or serverless functions on GCP. • Ability to write clean internal tooling (e.g., dashboards, CLI utilities). <h3>🏖️ Benefits</h3> • Amazing work culture (Super collaborative & supportive work environment; 5 days a week) • Awesome colleagues (Surround yourself with top talent from Meta, Google, LinkedIn etc. as well as people with deep startup experience) • Competitive compensation • Flexible working hours • Full-time remote opportunity

Job Details

Employment Type

Full-time

Salary Range

$125K–$157K

Estimated

Location

🇧🇷 Brazil – Remote

Remote Work

Remote Friendly