Name: AI Career Space
Availability: InStock
Rating: 4.8 (1250 reviews)

About the Role

This role involves building and optimizing distributed training infrastructure for large-scale machine learning models, particularly deep learning and transformer architectures. The engineer will collaborate with scientists and engineers to deliver scalable, high-performance systems for state-of-the-art AI research and applications in robotics.

Key Responsibilities

Design, build, and optimize machine learning infrastructure for large-scale training and inference
Apply PyTorch, Python, and C++ skills to engineer modular, scalable ML systems
Evaluate and implement parallelism techniques such as data, tensor, model, and pipeline parallelism
Monitor and optimize GPU memory and throughput for training large models efficiently
Collaborate cross-functionally with research and data infra teams to integrate new models and features

Required Skills & Qualifications

Must Have:

3+ years of non-internship professional software development experience
2+ years of non-internship design or architecture experience for new and existing systems
Experience programming with at least one software programming language
Deep understanding of LLM algorithms and deep learning frameworks like PyTorch
Strong understanding of linear algebra, calculus, probability, and statistics

Nice to Have:

3+ years of full software development life cycle experience including coding standards, code reviews, source control, build processes, testing, and operations
Bachelor's degree in computer science or equivalent

Benefits & Perks

Full range of medical, financial, and/or other benefits
Equity, sign-on payments, and other forms of compensation may be provided
Inclusive culture with workplace accommodations for disabilities

SDE- ML Engineer, Frontier AI Robotics at Amazon.com Services LLC