Site Reliability Engineer Job in Blueshift Labs

Site Reliability Engineer

Apply Now
Job Summary

Responsibilities

  • On-call duties to provide application support, incident management, and troubleshooting
  • Shift rotation timings to cover availability of SRE function 24x7
  • Improve reliability and drive down the burden of toil with tooling and automation
  • Analyze complex systems from a reliability, resilience, and performance perspective
  • Identify sources of instability in large-scale distributed systems and drive operational excellence
  • Hands on implementation and management of complex virtualized environments
  • Implement scale-up / scale-down strategies based on various utilization metrics
  • Author incident reports by coordinating with multiple engineering teams
  • Identify and fill gaps in the monitoring & alerting system
  • Periodic reporting of system status to the organization

Requirements

  • 5+ years of relevant industry experience
  • Prior hands-on experience with managing AWS and cloud infrastructure scaling to hundreds of nodes
  • Experience with managing a container orchestration system
  • Deep understanding of large scale data systems and data pipelines including managing NoSQL, SQL and HDFS/Hadoop clusters
  • Experience with modern SRE practices & tools
  • Hands-on experience with active incident management
  • Willingness & ability to work in night shifts
Experience Required :

Minimum 5 Years

Vacancy :

2 - 4 Hires

Similar Jobs for you

See more recommended jobs