Back to jobsJob overview
About the role
Site Reliability Engineer - CTJ - Poly at Microsoft
Required Skills
distributed systemsautomationmonitoringcloud technologiesincident responsecapacity planningscriptingtelemetry analysis
About the Role
This Site Reliability Engineer role focuses on improving reliability, performance, and scalability of large-scale distributed systems. Responsibilities include automating operations, analyzing telemetry, and participating in on-call incident response. The position requires a U.S. government Top Secret clearance with SCI and polygraph.Key Responsibilities
- Independently creates, tests, and deploys changes through safe deployment processes to enhance code quality and system observability
- Writes code or scripts to automate scalable operations processes like monitoring, alerting, and deployments
- Develops alerts and instrumentation to monitor product capacity, security risks, and resource demands
- Engages with product engineering teams through code reviews, meetings, on-call rotations, and incident response
- Uses tools and models to troubleshoot availability, security, reliability, and performance issues
Required Skills & Qualifications
Must Have:
- Master's Degree in Computer Science, IT, or related field AND 1+ years technical experience OR Bachelor's Degree AND 2+ years experience OR equivalent experience
- Active U.S. Government Top Secret Clearance with access to Sensitive Compartmented Information (SCI) based on Single Scope Background Investigation (SSBI) with Polygraph
- Verification of U.S. citizenship
- Ability to pass Microsoft Cloud background check upon hire and every two years
Nice to Have:
- Experience working on large-scale distributed services with on-call responsibilities
- Ability to build and influence broadly towards common goals and priorities
- Experience with distributed database systems such as SQL and PostgreSQL
Benefits & Perks
- Industry leading healthcare