Site Reliability Engineer - Ii Job in Mindtickle
Site Reliability Engineer - Ii
- Bengaluru, Bangalore Urban, Karnataka
- Not Disclosed
- Full-time
Job Brief You will be a part of the Site Reliability Team of very competent engineers. The SRE team at mindtickle is a sub-team as a part of DevOps, which is overall responsible for the maintenance of our production infrastructure, tools, and pipelines. The SRE team will particularly be responsible for discovering, implementing, and automating all such measures which are necessary to maintain mindtickle platform's uptime. This would include infrastructure, tools, and automation related to measuring, monitoring, and alerting on uptime, SLAs, costs, etc., load testing and scaling for various kinds of infrastructures such as services, queues, functions, databases, data pipelines, etc., incident responses and post mortems, etc., leveraging checks and automation tests to ensure sanity, etc. We at mindtickle have an extensive platform supporting up to 225 service components fully on top of AWS with 5 9s availability which will provide an ample challenge for a capable engineer like you. Your mandate will be to address the challenging problems of an exponentially growing org that is currently a 150-strong engineering team. You will be improving our systems, approaches, processes, and tools constantly to continue developing mindtickle into a world-class engineering team. You will need extensive and hands-on knowledge of these technologies, exceptional ability, and deep interest to learn new developments in this field, ample energy to evangelize and implement appropriate solutions across the org, and a keen interest in growing and mentoring your fellow team members. What you'll be doing: Work in concert with engineering teams to evolve services for better scalability, reliability, and development velocity. Maintain services once they are live by measuring and monitoring availability, latency, and overall system health. Focus on improving Reliability. You will be responsible for defining, measuring, and tuning key performance indicators and metrics in order to ensure a seamless experience. Develop tools to improve the ability to rapidly deploy and effectively monitor custom applications in large-scale environments. Practice sustainable incident response and blameless postmortems. Expert in the configuration and maintenance of common applications such as Apache, Tomcat, Nginx, Memcache, Squid, Oauth, NFS, DHCP, DNS, and SNMP. Thorough knowledge of deployment, management, and cost optimization techniques for Machine Instances on Public Clouds (AWS or GCP, or Azure). Designed Monitoring, Logging, and Reliability Processes for systems at scale. Design and develop solutions for cloud security, secrets management, and key rotations. Providing on-call support on a rotation basis for services running on the mindtickle platform, Incidents Management, and working with Application Developers for Root Cause Analysis. Ability to quickly learn new processes, applications, and tools as required. Maintain, review, propose, and implement improvements to existing infrastructure, tools, and processes. Build, maintain, and own InfoSec compliance efforts by implementing and enforcing appropriate processes and standards across the organization. Contribute to open-source software projects. Requirements: Bachelor's Degree in Computer Science or equivalent with a minimum of 4 years of relevant experience. Experience with Cloud IaaS (primarily AWS, Azure, and GCP). Expertise in Grafana, Prometheus, OpenTelemetry, and Thanos. Experience in observability tools like Datadog, Sumologic. Experience with on-call and tools used like pager duty. Experience with containers and orchestration (Docker, Kubernetes). In-depth Linux/Unix knowledge, and a good understanding of the various kernel subsystems (CPU, memory, storage, network, etc). Solid understanding of load balancing, TCP/IP networking, and CDNs. Experience with modern software components (Nginx, Mongo, Postgres, Redis, ElasticSearch, RabbitMQ, JVM, Play). Configuration management using tools like Ansible, Working experience of source code management systems like Git. Experience coding in higher-level languages (e.g., Python, Golang or Java). Experience automating tasks with scripting languages such as Python, Go, and/or JavaScript. IAC(Infrastructure as code) for Infrastructure Automation.Cloud formation. Terraform. Passion for automating everything repetitive.

