Back to jobsJob overview

About the role

Principal Software Engineer at Microsoft

Required Skills

pythonc++azurehpcai systemstelemetrydata pipelinescloud infrastructuregpu systems

About the Role

Principal Software Engineer designing and developing high-volume low-latency telemetry pipelines for Azure's flagship supercomputers used by top AI customers. The role involves managing large-scale HPC & GPU systems, connecting to existing telemetry pipelines, and delivering insights on customer-facing issues across the infrastructure stack.

Key Responsibilities

  • Architect, design and develop high volume low latency end to end event pipelines
  • Conduct analysis of existing event pipelines to evaluate fidelity, granularity and latency
  • Contribute to improving key metrics such as Job Mean Time to Interrupt and Mean Time to Resolve
  • Partner with cross organizational teams to evaluate available telemetry and drive architecture solutions
  • Drive engineering and operational excellence based on issues and learnings from strategic customers

Required Skills & Qualifications

Must Have:

  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience
  • 5+ years hands on experience designing and developing high volume low latency pipelines
  • 3+ years of experience with AI/HPC system management OR High-Speed Networks OR HPC Storage OR managing Cloud Infrastructure
  • Ability to meet Microsoft security screening requirements including Microsoft Cloud Background Check

Nice to Have:

  • Bachelor's Degree in Computer Science AND 10+ years technical engineering experience OR Master's Degree AND 8+ years experience
  • 5+ years of experience in operating AI/HPC systems, developing and running AI/HPC applications on clusters, or operating Cloud Infrastructure
  • 3+ years of experience in multiple DataCenter technologies: power, cooling, IT hardware, telemetry

Benefits & Perks

  • Industry leading healthcare