Back to jobsJob overview
About the role
Cloud Network Engineer at Microsoft
Required Skills
networkinginfinibandethernetlinuxbgpmplsvxlan/evpntelemetryai/hpc
About the Role
Cloud Network Engineer role focused on designing and operating high-performance networking systems for AI/HPC clusters. Responsibilities include deploying low-latency network topologies, monitoring network health, and collaborating across hardware and software teams. Requires experience with data center networks and distributed computing platforms.Key Responsibilities
- Support deployment of high-throughput, low-latency network topologies (e.g., Clos, FatTree) using InfiniBand and Ethernet
- Monitor network health, respond to incidents, perform root-cause analysis, and improve availability and observability
- Collaborate with hardware engineering, data center operations, and software-defined networking teams
- Maintain documentation for network designs, cabling standards, and deployment procedures
- Stay informed about advancements in optical networking, high-speed interconnects, and AI/HPC fabric technologies
Required Skills & Qualifications
Must Have:
- Bachelor's Degree in Electrical Engineering, Optical Engineering, Computer Science, Engineering, Information Technology, or related field OR equivalent experience
- Experience designing, deploying, and supporting data center and backbone networks for distributed computing platforms
- Ability to pass Microsoft Cloud Background Check upon hire/transfer and every two years thereafter
Nice to Have:
- Master's Degree in Electrical Engineering, Optical Engineering, Computer Science, Information Technology, or related field OR Bachelor's Degree with 2+ years technical experience in network design, development, and automation
- Proficient understanding of Routing Protocols including BGP, MPLS and tunneling techniques including VxLAN/EVPN
- Experience with telemetry and observability tools for monitoring physical network health, link performance, and congestion at scale
- Background in building scalable, fault-tolerant physical networks for distributed computing environments (e.g., AI/ML clusters, HPC systems)
- Proficiency in Linux-based systems, including kernel-level networking, interface tuning, and low-level debugging of physical network issues
Benefits & Perks
- Industry leading healthcare