Back to jobsJob overview

About the role

Software Engineer- AI/ML, AWS Neuron Distributed Training at Annapurna Labs (U.S.) Inc.

Required Skills

pythonpytorchtensorflowdistributed trainingmachine learningawssoftware developmentlarge language models

About the Role

This role is for a senior software engineer in the Machine Learning Applications team for AWS Neuron, focusing on distributed training of large-scale ML models like GPT and stable diffusion. The engineer will build and tune distributed training solutions using PyTorch and TensorFlow on AWS Trainium and Inferentia silicon. Strong software development and ML expertise are essential.

Key Responsibilities

  • Build, deliver, and maintain complex products for AWS Neuron distributed training
  • Design fault-tolerant systems that run at massive scale in the AWS Cloud
  • Develop, enable, and performance tune a wide variety of ML model families, including large language models
  • Lead efforts building distributed training support into PyTorch and TensorFlow using XLA and Neuron stacks
  • Tune models to ensure highest performance on AWS Trainium and Inferentia silicon

Required Skills & Qualifications

Must Have:

  • 5+ years of non-internship professional software development experience
  • 5+ years of programming with at least one software programming language
  • 5+ years of leading design or architecture of new and existing systems
  • 5+ years of full software development life cycle experience

Nice to Have:

  • Bachelor's degree in computer science or equivalent

Benefits & Perks

  • Inclusive team culture with employee-led affinity groups
  • Work-life balance with flexible working hours
  • Mentorship and career growth opportunities
  • Comprehensive compensation package including medical, financial, and other benefits