segment-pixel
For the best experience, try the new Microsoft Edge browser recommended by Microsoft (version 87 or above) or switch to another browser � Google Chrome / Firefox / Safari
OK
brand-elementsbrand-elements brand-elements brand-elements
brand-elementsbrand-elements

Site Reliability Engineering (SRE) is an approach introduced in 2003 (pre-DevOps) by Google’s Benjamin Treynor Sloss, a software developer and leader of a new Ops team of 7 engineers. The team was associated with the improvement of Google’s massive websites and make it more efficient, scalable and reliable. The approach was designed to apply proven software engineering principles to infrastructure and operations. The SRE approach helped developers to focus on innovation and the operations team to work on consistency and reliability to ensure all-around success. 

Eventually, many enterprises started using the SRE approach. They realized how it helps in building a scalable and reliable process for development.  During this period, SRE was instrumental in automating operations and using business metrics and KPIs to measure the impact.

Today, the SRE team manages the performance, latency, and efficiency of the systems by:

  • Ensuring that transactions occur without error within the cut-off time
  • Automating issue detection and ensuring that defects are fixed on time
  • Improving collaboration between different teams by reducing silos
  • Reducing failure rates and downtime

SRE is often compared with DevOps as the premise of both is the same – better collaboration between teams, automation, and several other factors. However, DevOps focuses more on the delivery, while SRE focuses on building system reliability.

To harness the potential of SRE, enterprises must understand both the potential challenges and best practices.

The Challenges of Site Reliability Engineering (SRE)

  1. Imbalance in Operations and Development Tasks: According to Google’s benchmark, the ideal split in SRE teams is 50/50 – 50% on improving operational tasks such as ticket handling, calls, etc., and 50% on development. However, the reality is often quite different. A survey revealed that 55% of respondents spent only 0-25% of their time on development. This imbalance stops the team from developing applications and innovating.
     
  2. Repetitive Tasks: One of the main expectations of SRE is the elimination of toil - tasks that are manual and repetitive. According to Google, 50% of the SRE team’s work must be spent on reducing toil. However, if left unchecked, some projects could consume up to 100% of the team’s time. For example, a pure development project with no calls would have 0% toil, while one with more operational work could require 80% toil. Therefore, managers must frequently measure the time spent on toiling to ensure the toil load is spread evenly across the team.
     
  3. Poor Incident Management: A postmortem of every incident is important as it helps the team avoid the same errors in the future. However, enterprises don’t always take these postmortems seriously. Because it’s an unstructured process, it’s often ineffective; lessons are not learned and mistakes are repeated. Another challenge is that incident management is not proactive. The SRE team waits for the incident to occur to respond. It’s also not unusual that proper incident response management is not provided to new members. A successful SRE implementation must provide training and have a comprehensive incident management mechanism to take pre-emptive measures before an incident occurs.

Best Practices of Site Reliability Engineering (SRE)

Xoriant-4-SRE-Best-Practices
 

  1. Improve Incident Management: To prevent incidents or resolve them quickly, enterprises must establish a proper SRE incident management process. This will help the SRE teams identify, log, and categorize incidents based on urgency and impact, then prioritize them accordingly. Once the incident is resolved, the team can close it and update the status. In an ideal situation, the SRE team will also conduct a postmortem of the incident to identify areas of improvement and help ensure a resilient system and effective incident management.
     
  2. Include Embedded SREs in Operations: Sometimes, the volume of daily tickets is so high that operational tasks consume most of their time, slowing progress. Embedding SREs into operations can solve this issue. The SRE will observe the team’s daily tasks and recommend solutions to improve the process and the outcome and provide engineering best practices to maintain reliability and scalability throughout the project lifecycle.
     
  3. Tailor the Monitoring of Monolithic vs. Distributed Environments: Monitoring the reliability of a monolithic environment is different from monitoring a distributed one. For example, a monolith application has all the logs and metrics stored in a single log file, while a distributed application has different data sources. However, the risks with the distributed application are less; for example, it is easier and faster to redeploy a distributed application than a monolith application. To be effective, the SRE team should tailor its monitoring approach accordingly.
     
  4. Apply Exhaustive Rules in a Managed Cloud Environment: As more critical applications move to the cloud, enterprises must ensure that the systems are resilient. To enable the SRE teams to manage the cloud environment efficiently requires extensive rules. These rules should be able to alert the team during an incident and identify undetected conditions that are urgent and actionable. By providing a clear picture of an impending failure, the SRE team will be able to take appropriate action in near real time.

Conclusion

COVID-19 has clearly demonstrated the need for enterprises to build resilient systems to ensure business continuity. This has increased the applicability of SRE, with adoption growing by 21% in 2021 compared to 15% in 2020. Digital tech leaders like Netflix have invested heavily in developing a core SRE team to identify potential risks and proactively respond to them. 

SRE has the potential to change how IT operations function and how products are built and released. The onus lies with enterprises prioritizing SRE adoption and harnessing its potential this year.

At Xoriant, we address site reliability so enterprises can focus on growing their business.

Talk to Xoriant SRE Experts

Get Started

Your Information

1 + 1 =
Solve this simple math problem and enter the result. E.g. for 1+3, enter 4.

Your Information

6 + 3 =
Solve this simple math problem and enter the result. E.g. for 1+3, enter 4.

Your Information

1 + 0 =
Solve this simple math problem and enter the result. E.g. for 1+3, enter 4.
Globally Presence
Across Americas, Europe, and Asia
All Locations
Asia
Europe
North America
global-map
16 Locations
6
8
2
asia-map
8 Locations
map-pin
Singapore
70 Shenton Way,
#13-03,
Eon Shenton,
Singapore 079118
map-pin
Gurugram
5th Floor, Tower B,
Golf View Corporate Towers,
Sector 42, Golf Course Road,
Gurugram- 122002
map-pin
Hyderabad
5th Floor, Smartworks, Block 3, DLF Cybercity, Survey No. 129 to 132,
Gachibowli Village, Serilingampally, (M) Ranga Reddy District,
Hyderabad, Telangana 500032
map-pin
Bengaluru
3rd Floor, Karle Town, Building No. 5
Nagavara Village Kasaba Hobli,
Banglore North,
Bengaluru, Karnataka 560045
map-pin
Chennai
8th Floor, Smartworks,
Olympia National Tower
Block 3, A3 and A4, North Phase,
Guindy Industrial Estate, Chennai 600032
map-pin
Pune
Smartworks 43 EQ, 14th-15th Floor,
Sai Chowk Road,
Opposite Bharati Vidyapeeth School,
Laxman Nagar, Balewadi Pune,
Maharashtra 411045
map-pin
Mumbai - Thane
8th Floor, 315 Work Avenue,
Ekatva Olethia Building,
Opposite Ashar IT Main Gate,
Wagle Industrial Estate,
Thane West, 400604
map-pin
Mumbai
7th Floor, Redbrick,
Oberoi Commerz-1
Oberoi Garden City,
Goregaon East 400063
europe-map
2 Locations
map-pin
Ireland
Grove, Fethard,
Co. Tipperary,
E91 E282, Dublin, Ireland
map-pin
London
c/o SPACES,
12 Hammersmith Grove,
London W67AP, UK
north-america-map
6 Locations
map-pin
Canada
55 York Street, Suite 401
Toronto, ON,
Canada M5J 1R7
map-pin
Mexico
Tomas A. Edison 1510-201
Ciudad Juárez,
Chihuahua, Mexico 32300
map-pin
Dallas
5800 Granite Parkway,
Suite 480
Plano, TX, 75024
map-pin
Troy
6915 Rochester Road
Suite 300
Troy, MI 48085
map-pin
Sunnyvale
1248 Reamwood Avenue
Sunnyvale, CA 94089
map-pin
New Jersey
343 Thornall Street
Suite 720
Edison, NJ 08837
All Locations
global-map
16 Locations
6
8
2
asia-map
8 Locations
map-pin
Singapore
70 Shenton Way,
#13-03,
Eon Shenton,
Singapore 079118
map-pin
Gurugram
5th Floor, Tower B,
Golf View Corporate Towers,
Sector 42, Golf Course Road,
Gurugram- 122002
map-pin
Hyderabad
5th Floor, Smartworks, Block 3, DLF Cybercity, Survey No. 129 to 132,
Gachibowli Village, Serilingampally, (M) Ranga Reddy District,
Hyderabad, Telangana 500032
map-pin
Bengaluru
3rd Floor, Karle Town, Building No. 5
Nagavara Village Kasaba Hobli,
Banglore North,
Bengaluru, Karnataka 560045
map-pin
Chennai
8th Floor, Smartworks,
Olympia National Tower
Block 3, A3 and A4, North Phase,
Guindy Industrial Estate, Chennai 600032
map-pin
Pune
Smartworks 43 EQ, 14th-15th Floor,
Sai Chowk Road,
Opposite Bharati Vidyapeeth School,
Laxman Nagar, Balewadi Pune,
Maharashtra 411045
map-pin
Mumbai - Thane
8th Floor, 315 Work Avenue,
Ekatva Olethia Building,
Opposite Ashar IT Main Gate,
Wagle Industrial Estate,
Thane West, 400604
map-pin
Mumbai
7th Floor, Redbrick,
Oberoi Commerz-1
Oberoi Garden City,
Goregaon East 400063
europe-map
2 Locations
map-pin
Ireland
Grove, Fethard,
Co. Tipperary,
E91 E282, Dublin, Ireland
map-pin
London
c/o SPACES,
12 Hammersmith Grove,
London W67AP, UK
north-america-map
6 Locations
map-pin
Canada
55 York Street, Suite 401
Toronto, ON,
Canada M5J 1R7
map-pin
Mexico
Tomas A. Edison 1510-201
Ciudad Juárez,
Chihuahua, Mexico 32300
map-pin
Dallas
5800 Granite Parkway,
Suite 480
Plano, TX, 75024
map-pin
Troy
6915 Rochester Road
Suite 300
Troy, MI 48085
map-pin
Sunnyvale
1248 Reamwood Avenue
Sunnyvale, CA 94089
map-pin
New Jersey
343 Thornall Street
Suite 720
Edison, NJ 08837