Technical Lead - Sre (site Reliability Engineering) Job in First Advantage

Technical Lead - Sre (site Reliability Engineering)

Apply Now
Job Summary

Responsibilities: Run the production environment by monitoring availability and taking a holistic view of system health Build software and systems to manage platform infrastructure and applications Drive implementation of automation and monitoring to promote early detection, self-healing, improved availability, and decreased number of outages Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating to continually improve Reduce operational inefficiencies in the incident management process to ensure the fastest path to recovery through automation and continuous process improvement. Identify when escalation is required and trigger such escalation accordingly. This role will be strategic in nature implementing best in class Incident response and communications through modern solutions such as Teams, SharePoint, etc. This will ensure our internal stakeholders and customers have accurate communications of any ongoing outages and what we are doing to restore as well as prevent it from occurring again. This includes driving Incident bridges to resolution with highest sense of urgency Gather and analyze metrics from both operating systems and applications to assist in performance tuning and fault finding Create and maintain recovery playbooks for commonly occurring customer patterns and issues. Drive down resolution times by improving alert coverage and accuracy. Create sustainable systems and services through automation and uplifts Implement Automated Recovery Scripts and other monitoring enhancements Participate in system design consulting, platform management, and capacity planning Provide primary operational support and engineering for multiple large, distributed software applications Lead after action reviews and root cause analysis on a timely basis that identify repair items preventing future customer impact. Ensure resolution of product/service defects, process improvements and documentation enhancement to address live site or customer reported incidents What You May Need to be Successful 4-year College minimum in related technology field (Computer, Engineering, Science, etc.) or comparable job experience. SRE (Site Reliability Engineering) related Certification. 7+ years of experience in information technology preferably managing large-scale environments Recent work experience in an SRE role implementing best in class Reliability solutions in a Large Product development organization 3+ years of work experience with public cloud platform Azure Experience with Azure Monitor & AppInsights is preferred Ability to program (structured and OO) with one or more high level languages, such as Python, Java, C/C++, and JavaScript A proactive approach to spotting problems, areas for improvement, and performance bottlenecks. Outstanding communication and presentation skills, written and verbal. Excellent listening skills and a high degree of empathy. Proficient in quick problem-solving skills with attention to detail. You must be able to work outside of normal business hours (weekend shifts, holidays, & evenings) Excellent managerial skills and ability to collaborate with team members. Strong analytical, and time management skills. Incorporate various software engineering aspects to develop and implement services that improve IT and support teams. Services can range from production code changes to alerting and monitoring adjustments

Experience Required :

Fresher

Vacancy :

2 - 4 Hires

Similar Jobs for you

See more recommended jobs