SRE Challenges & Best Practices to Build Resilient Systems
Site Reliability Engineering (SRE) is an approach introduced in 2003 (pre-DevOps) by Google’s Benjamin Treynor Sloss, a software developer and leader of a new Ops team of 7 engineers. The team was associated with the improvement of Google’s massive websites and make it more efficient, scalable and reliable. The approach was designed to apply proven software engineering principles to infrastructure and operations. The SRE approach helped developers to focus on innovation and the operations team to work on consistency and reliability to ensure all-around success.
Eventually, many enterprises started using the SRE approach. They realized how it helps in building a scalable and reliable process for development. During this period, SRE was instrumental in automating operations and using business metrics and KPIs to measure the impact.
Today, the SRE team manages the performance, latency, and efficiency of the systems by:
- Ensuring that transactions occur without error within the cut-off time
- Automating issue detection and ensuring that defects are fixed on time
- Improving collaboration between different teams by reducing silos
- Reducing failure rates and downtime
SRE is often compared with DevOps as the premise of both is the same – better collaboration between teams, automation, and several other factors. However, DevOps focuses more on the delivery, while SRE focuses on building system reliability.
To harness the potential of SRE, enterprises must understand both the potential challenges and best practices.
The Challenges of Site Reliability Engineering (SRE)
- Imbalance in Operations and Development Tasks: According to Google’s benchmark, the ideal split in SRE teams is 50/50 – 50% on improving operational tasks such as ticket handling, calls, etc., and 50% on development. However, the reality is often quite different. A survey revealed that 55% of respondents spent only 0-25% of their time on development. This imbalance stops the team from developing applications and innovating.
- Repetitive Tasks: One of the main expectations of SRE is the elimination of toil - tasks that are manual and repetitive. According to Google, 50% of the SRE team’s work must be spent on reducing toil. However, if left unchecked, some projects could consume up to 100% of the team’s time. For example, a pure development project with no calls would have 0% toil, while one with more operational work could require 80% toil. Therefore, managers must frequently measure the time spent on toiling to ensure the toil load is spread evenly across the team.
- Poor Incident Management: A postmortem of every incident is important as it helps the team avoid the same errors in the future. However, enterprises don’t always take these postmortems seriously. Because it’s an unstructured process, it’s often ineffective; lessons are not learned and mistakes are repeated. Another challenge is that incident management is not proactive. The SRE team waits for the incident to occur to respond. It’s also not unusual that proper incident response management is not provided to new members. A successful SRE implementation must provide training and have a comprehensive incident management mechanism to take pre-emptive measures before an incident occurs.
Best Practices of Site Reliability Engineering (SRE)
- Improve Incident Management: To prevent incidents or resolve them quickly, enterprises must establish a proper SRE incident management process. This will help the SRE teams identify, log, and categorize incidents based on urgency and impact, then prioritize them accordingly. Once the incident is resolved, the team can close it and update the status. In an ideal situation, the SRE team will also conduct a postmortem of the incident to identify areas of improvement and help ensure a resilient system and effective incident management.
- Include Embedded SREs in Operations: Sometimes, the volume of daily tickets is so high that operational tasks consume most of their time, slowing progress. Embedding SREs into operations can solve this issue. The SRE will observe the team’s daily tasks and recommend solutions to improve the process and the outcome and provide engineering best practices to maintain reliability and scalability throughout the project lifecycle.
- Tailor the Monitoring of Monolithic vs. Distributed Environments: Monitoring the reliability of a monolithic environment is different from monitoring a distributed one. For example, a monolith application has all the logs and metrics stored in a single log file, while a distributed application has different data sources. However, the risks with the distributed application are less; for example, it is easier and faster to redeploy a distributed application than a monolith application. To be effective, the SRE team should tailor its monitoring approach accordingly.
- Apply Exhaustive Rules in a Managed Cloud Environment: As more critical applications move to the cloud, enterprises must ensure that the systems are resilient. To enable the SRE teams to manage the cloud environment efficiently requires extensive rules. These rules should be able to alert the team during an incident and identify undetected conditions that are urgent and actionable. By providing a clear picture of an impending failure, the SRE team will be able to take appropriate action in near real time.
COVID-19 has clearly demonstrated the need for enterprises to build resilient systems to ensure business continuity. This has increased the applicability of SRE, with adoption growing by 21% in 2021 compared to 15% in 2020. Digital tech leaders like Netflix have invested heavily in developing a core SRE team to identify potential risks and proactively respond to them.
SRE has the potential to change how IT operations function and how products are built and released. The onus lies with enterprises prioritizing SRE adoption and harnessing its potential this year.
At Xoriant, we address site reliability so enterprises can focus on growing their business.