SRE Challenges & Best Practices to Build Resilient Systems

Site Reliability Engineering (SRE) is an approach introduced in 2003 (pre-DevOps) by Google’s Benjamin Treynor Sloss, a software developer and leader of a new Ops team of 7 engineers. The team was associated with the improvement of Google’s massive websites and make it more efficient, scalable and reliable. The approach was designed to apply proven software engineering principles to infrastructure and operations. The SRE approach helped developers to focus on innovation and the operations team to work on consistency and reliability to ensure all-around success.

Eventually, many enterprises started using the SRE approach. They realized how it helps in building a scalable and reliable process for development. During this period, SRE was instrumental in automating operations and using business metrics and KPIs to measure the impact.

Today, the SRE team manages the performance, latency, and efficiency of the systems by:

Ensuring that transactions occur without error within the cut-off time
Automating issue detection and ensuring that defects are fixed on time
Improving collaboration between different teams by reducing silos
Reducing failure rates and downtime

SRE is often compared with DevOps as the premise of both is the same – better collaboration between teams, automation, and several other factors. However, DevOps focuses more on the delivery, while SRE focuses on building system reliability.

To harness the potential of SRE, enterprises must understand both the potential challenges and best practices.

The Challenges of Site Reliability Engineering (SRE)

Imbalance in Operations and Development Tasks: According to Google’s benchmark, the ideal split in SRE teams is 50/50 – 50% on improving operational tasks such as ticket handling, calls, etc., and 50% on development. However, the reality is often quite different. A survey revealed that 55% of respondents spent only 0-25% of their time on development. This imbalance stops the team from developing applications and innovating.
Repetitive Tasks: One of the main expectations of SRE is the elimination of toil - tasks that are manual and repetitive. According to Google, 50% of the SRE team’s work must be spent on reducing toil. However, if left unchecked, some projects could consume up to 100% of the team’s time. For example, a pure development project with no calls would have 0% toil, while one with more operational work could require 80% toil. Therefore, managers must frequently measure the time spent on toiling to ensure the toil load is spread evenly across the team.
Poor Incident Management: A postmortem of every incident is important as it helps the team avoid the same errors in the future. However, enterprises don’t always take these postmortems seriously. Because it’s an unstructured process, it’s often ineffective; lessons are not learned and mistakes are repeated. Another challenge is that incident management is not proactive. The SRE team waits for the incident to occur to respond. It’s also not unusual that proper incident response management is not provided to new members. A successful SRE implementation must provide training and have a comprehensive incident management mechanism to take pre-emptive measures before an incident occurs.

Best Practices of Site Reliability Engineering (SRE)

Xoriant-4-SRE-Best-Practices

Improve Incident Management: To prevent incidents or resolve them quickly, enterprises must establish a proper SRE incident management process. This will help the SRE teams identify, log, and categorize incidents based on urgency and impact, then prioritize them accordingly. Once the incident is resolved, the team can close it and update the status. In an ideal situation, the SRE team will also conduct a postmortem of the incident to identify areas of improvement and help ensure a resilient system and effective incident management.
Include Embedded SREs in Operations: Sometimes, the volume of daily tickets is so high that operational tasks consume most of their time, slowing progress. Embedding SREs into operations can solve this issue. The SRE will observe the team’s daily tasks and recommend solutions to improve the process and the outcome and provide engineering best practices to maintain reliability and scalability throughout the project lifecycle.
Tailor the Monitoring of Monolithic vs. Distributed Environments: Monitoring the reliability of a monolithic environment is different from monitoring a distributed one. For example, a monolith application has all the logs and metrics stored in a single log file, while a distributed application has different data sources. However, the risks with the distributed application are less; for example, it is easier and faster to redeploy a distributed application than a monolith application. To be effective, the SRE team should tailor its monitoring approach accordingly.
Apply Exhaustive Rules in a Managed Cloud Environment: As more critical applications move to the cloud, enterprises must ensure that the systems are resilient. To enable the SRE teams to manage the cloud environment efficiently requires extensive rules. These rules should be able to alert the team during an incident and identify undetected conditions that are urgent and actionable. By providing a clear picture of an impending failure, the SRE team will be able to take appropriate action in near real time.

Conclusion

COVID-19 has clearly demonstrated the need for enterprises to build resilient systems to ensure business continuity. This has increased the applicability of SRE, with adoption growing by 21% in 2021 compared to 15% in 2020. Digital tech leaders like Netflix have invested heavily in developing a core SRE team to identify potential risks and proactively respond to them.

SRE has the potential to change how IT operations function and how products are built and released. The onus lies with enterprises prioritizing SRE adoption and harnessing its potential this year.

At Xoriant, we address site reliability so enterprises can focus on growing their business.

Talk to Xoriant SRE Experts

View Previous Blog

View Next Blog

Get Started

Name

Phone

Company

We are looking for

Message

I agree to your privacy and cookie policies.

Math question

2 + 0 =

Solve this simple math problem and enter the result. E.g. for 1+3, enter 4.

All Locations

Asia

Europe

North America

17 Locations

9 Locations

Singapore

70 Shenton Way,
#13-03,
Eon Shenton,
Singapore 079118

Gurugram

5th Floor, Tower B,
Golf View Corporate Towers,
Sector 42, Golf Course Road,
Gurugram- 122002

Hyderabad

5th Floor, Smartworks, Block 3, DLF Cybercity, Survey No. 129 to 132,
Gachibowli Village, Serilingampally, (M) Ranga Reddy District,
Hyderabad, Telangana 500032

Pune

Smartworks 43 EQ, 14th-15th Floor,
Sai Chowk Road,
Opposite Bharati Vidyapeeth School,
Laxman Nagar, Balewadi Pune,
Maharashtra 411045

Chennai

10th Floor, Smartworks,
Olympia National Tower
Block 3, A3 and A4, North Phase,
Guindy Industrial Estate, Chennai 600032

Bengaluru

3rd Floor, Karle Town, Building No. 5
Nagavara Village Kasaba Hobli,
Banglore North,
Bengaluru, Karnataka 560045

Bengaluru

MapleLabs (A Xoriant Company)
2nd Floor, Vaishnavi Summit,
6/B, 80 Feet Rd, 3rd Block,
Koramangala 1A Block,
Bengaluru, Karnataka 560034

Mumbai - Thane

8th Floor, 315 Work Avenue,
Ekatva Olethia Building,
Opposite Ashar IT Main Gate,
Wagle Industrial Estate,
Thane West, 400604

Mumbai

7th Floor, Redbrick,
Oberoi Commerz-1
Oberoi Garden City,
Goregaon East 400063

2 Locations

Ireland

Grove, Fethard,
Co. Tipperary,
E91 E282, Dublin, Ireland

London

c/o SPACES,
12 Hammersmith Grove,
London W67AP, UK

6 Locations

Canada

55 York Street, Suite 401
Toronto, ON,
Canada M5J 1R7

Mexico

Tomas A. Edison 1510-201
Ciudad Juárez,
Chihuahua, Mexico 32300

Dallas

5800 Granite Parkway,
Suite 480
Plano, TX, 75024

Troy

6915 Rochester Road
Suite 300
Troy, MI 48085

Sunnyvale

1248 Reamwood Avenue
Sunnyvale, CA 94089

New Jersey

343 Thornall Street
Suite 720
Edison, NJ 08837

All Locations

17 Locations

Asia

9 Locations

Singapore

70 Shenton Way,
#13-03,
Eon Shenton,
Singapore 079118

Gurugram

5th Floor, Tower B,
Golf View Corporate Towers,
Sector 42, Golf Course Road,
Gurugram- 122002

Hyderabad

5th Floor, Smartworks, Block 3, DLF Cybercity, Survey No. 129 to 132,
Gachibowli Village, Serilingampally, (M) Ranga Reddy District,
Hyderabad, Telangana 500032

Pune

Smartworks 43 EQ, 14th-15th Floor,
Sai Chowk Road,
Opposite Bharati Vidyapeeth School,
Laxman Nagar, Balewadi Pune,
Maharashtra 411045

Chennai

10th Floor, Smartworks,
Olympia National Tower
Block 3, A3 and A4, North Phase,
Guindy Industrial Estate, Chennai 600032

Bengaluru

3rd Floor, Karle Town, Building No. 5
Nagavara Village Kasaba Hobli,
Banglore North,
Bengaluru, Karnataka 560045

Bengaluru

MapleLabs (A Xoriant Company)
2nd Floor, Vaishnavi Summit,
6/B, 80 Feet Rd, 3rd Block,
Koramangala 1A Block,
Bengaluru, Karnataka 560034

Mumbai - Thane

8th Floor, 315 Work Avenue,
Ekatva Olethia Building,
Opposite Ashar IT Main Gate,
Wagle Industrial Estate,
Thane West, 400604

Mumbai

7th Floor, Redbrick,
Oberoi Commerz-1
Oberoi Garden City,
Goregaon East 400063

Europe

2 Locations

Ireland

Grove, Fethard,
Co. Tipperary,
E91 E282, Dublin, Ireland

London

c/o SPACES,
12 Hammersmith Grove,
London W67AP, UK

North America

6 Locations

Canada

55 York Street, Suite 401
Toronto, ON,
Canada M5J 1R7

Mexico

Tomas A. Edison 1510-201
Ciudad Juárez,
Chihuahua, Mexico 32300

Dallas

5800 Granite Parkway,
Suite 480
Plano, TX, 75024

Troy

6915 Rochester Road
Suite 300
Troy, MI 48085

Sunnyvale

1248 Reamwood Avenue
Sunnyvale, CA 94089

New Jersey

343 Thornall Street
Suite 720
Edison, NJ 08837

Featured Insights

Digital Engineering

Featured Insights

Cloud and Infrastructure

Featured Insights

Data and AI

Featured Insights

Cyber Security

Featured Insights

Industries

Featured Insights

Partner Ecosystem

Featured Insights

Insights