Sr Site Reliability Engineer Job in Plivo

Sr Site Reliability Engineer

Apply Now
Job Summary

Roles and Responsibilities Develop and implement monitoring, alerting, and incident response processes to ensure the availability and performance of systems. Identify performance bottlenecks, conduct root cause analysis, and implement solutions to optimize system performance. Automate repetitive tasks, configuration management, and infrastructure provisioning to improve operational efficiency. Quickly diagnose production issues, document designs and procedures, scale the infrastructure to meet demands, and proactively ensure the highest levels of systems and infrastructure availability Collaborate with security teams to implement and maintain robust security practices and measures. Lead incident response efforts and dive it to closure. Drive blameless post-incident reviews and RCA meetings. Drive continuous improvement in incident management processes. Lead DR exercise and dry runs. Ensure every service has DR playbooks and SRE tests those playbooks. Ensure you collaborate with engineering and product and gain product knowledge. Drive remediations tasks to closure Mentor and provide technical guidance to junior SRE team members, fostering a culture of knowledge sharing and continuous learning. Stay updated with industry trends, best practices, and emerging technologies related to site reliability engineering and infrastructure automation. Qualifications Bachelor's or Master's degree in Computer Science, Engineering, or a related field. Proven experience as a Site Reliability Engineer or similar role, with a strong focus on managing and maintaining large-scale, distributed systems. Expertise in designing, building, and troubleshooting highly available and scalable systems on cloud platforms (e.g., AWS, Azure, Google Cloud). Proficiency in at least one programming language (e.g., Python, Java, Go) and experience with automation and configuration management tools (e.g., Ansible, Terraform). Strong understanding of networking concepts, including TCP/IP, DNS, load balancing, and firewalls. Experience with containerization and orchestration technologies (e.g., Docker, Kubernetes) and microservices architectures. Knowledge of monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack) and experience implementing observability practices. Familiarity with incident response and post-incident review processes, and understanding of error budgets and service-level objectives (SLOs). Excellent problem-solving and troubleshooting skills and the ability to analyze complex systems to identify and resolve issues. Strong communication and collaboration skills, with the ability to work effectively in cross-functional teams. Perks and Benefits Empowerment to plan and execute. Medical and Life Insurance. Open culture and working with a young and dynamic team. Career advancement opportunities. Generous leave policy.

Experience Required :

Fresher

Vacancy :

2 - 4 Hires

Similar Jobs for you

See more recommended jobs