Back to jobsJob overview

About the role

SDE- ML Engineer, Frontier AI Robotics at Amazon.com Services LLC

Required Skills

pythonpytorchc++machine learningdeep learningllmdistributed systemsgpu optimizationlinear algebra

About the Role

This role involves building and optimizing distributed training infrastructure for large-scale machine learning models, particularly deep learning and transformer architectures. The engineer will collaborate with scientists and engineers to deliver scalable, high-performance systems for state-of-the-art AI research and applications in robotics.

Key Responsibilities

  • Design, build, and optimize machine learning infrastructure for large-scale training and inference
  • Apply PyTorch, Python, and C++ skills to engineer modular, scalable ML systems
  • Evaluate and implement parallelism techniques such as data, tensor, model, and pipeline parallelism
  • Monitor and optimize GPU memory and throughput for training large models efficiently
  • Collaborate cross-functionally with research and data infra teams to integrate new models and features

Required Skills & Qualifications

Must Have:

  • 3+ years of non-internship professional software development experience
  • 2+ years of non-internship design or architecture experience for new and existing systems
  • Experience programming with at least one software programming language
  • Deep understanding of LLM algorithms and deep learning frameworks like PyTorch
  • Strong understanding of linear algebra, calculus, probability, and statistics

Nice to Have:

  • 3+ years of full software development life cycle experience including coding standards, code reviews, source control, build processes, testing, and operations
  • Bachelor's degree in computer science or equivalent

Benefits & Perks

  • Full range of medical, financial, and/or other benefits
  • Equity, sign-on payments, and other forms of compensation may be provided
  • Inclusive culture with workplace accommodations for disabilities