Geschreven door Luuk Tersmette

Is your organization ready for SRE?

DevOps7 minuten leestijd

Site Reliability Engineering (SRE) is a new discipline which is quickly gaining traction within the IT landscape. First conceived by Google, it is now adopted mainly by large enterprises such as Netflix, LinkedIn, Target, Tesla and many more. It is known for a focus on automating manual tasks without enduring value (toil), treating operations as a software problem and its unique approach to systems reliability. You might hear of the advantages of SRE and want to introduce it to your organization. Perhaps you are a software engineer and want to convince your peers to try it out or you are a manager overseeing one or more teams and would like to introduce SRE to them. The question is, are you ready to start?

In this article I would like to give you some tips and tricks in order to judge how ready your organization is for this SRE journey and if it is already adopted, determine the current level of SRE maturity.

DevOps adoption

As Google describes it, SRE is a superset of DevOps. Many of the SRE principles align with DevOps and build upon them. Having good adoption of DevOps principles is therefore an important factor in the success of implementing SRE. Is the DevOps maturity level relatively low, then you will likely benefit more by focusing on improving that first. 

The automation hierarchy

Google describes a number of levels in automation, the higher levels (lower on the list) being more advanced:

  • No automation - all operational tasks are performed manually
  • Externally maintained system-specific automation - operational tasks are performed using external systems such as scripts, specific for this one system
  • Externally maintained generic automation - there are generic systems that can perform tasks on multiple different systems
  • Internally maintained system-specific automation - systems include automated operational tasks out-of-the-box

So, what makes the higher levels better than the lower levels? Why is it better for a system to be internally monitored than externally? If you use an external system to monitor another system, you are now managing two systems. This usually increases the effort to maintain this process.

A system that is internally automated also has more possibilities to do so as it can directly tap into all of the functionality and therefore has more fine-grained control. If the automation is system specific as opposed to generic, the impact of this particular automated task is not as broad and use of these tasks scales more linear.

If you are low on this scale (less flexible and in lower amounts of automation), there will be more to be gained from automation but some aspects of SRE might not fully come into fruition just yet. SRE works best if there is a foundation of automation upon which to build. Additionally, hiring SRE engineers is easier as the profile of SRE engineers usually includes skills which are a match for more mature forms of automation. 

Types of SRE team implementations 

When it comes to SRE as a term, there are usually three definitions used.

  • As a role - the SRE engineer
  • As a team - the SRE team
  • As a set of practices and principles - the SRE discipline

In practice, these definitions are often intertwined and mixed depending on the context of discussions.

These definitions are commonly used in a number of forms of adoption. Depending on structure of your organization and how much knowledge about SRE is present and some of its principles are perhaps already adopted.

  • Consulting - SRE maturity level is zero to non-existent. Consultants provide guidance and advice but do not actively participate in engineering activities.
  • Embedded - SRE maturity is at a low level. Consultants actively participate in engineering but also keep sharing knowledge and giving direction
  • Platform - For medium SRE maturity. A SRE-minded platform team takes care of all supporting services and/or infrastructure for services.  Knowledge about delivered services is limited
  • Slice and Dice - Again for medium SRE maturity. Services are divided into functional slices such as payment processing or the user-facing helpdesk. SRE engineers responsible for this slice have knowledge of the complete 'flow' of the service and can therefore better tailor to automation needs
  • Full SRE - High SRE maturity. The full scope of a particular service or set of services involves SRE engineers and all team members have an SRE mindset

For all maturity levels there are gains to be made from adopting SRE and the aforementioned forms tailor to these varying needs. The red line in these forms is that as the SRE maturity level progresses, the scope of SRE widens and more of the service delivery flow is imbued with SRE practices and principles. Choose the form that suits the maturity level to make the most out of SRE adoption. 

Judging the SRE maturity level

So, know knowing there are different forms to adopt based on the maturity level, how can you judge how high the SRE maturity level is? Assuming SRE is already implemented to some degree, there are some indicators.

  • There is executive buy-in on an SRE strategy. Investing in SRE means investing more in reliability than velocity if reliability is under a certain threshold. For a lot of teams it means more investment in automation than new features, in the hopes it pays itself off in the future. As executives usually have a say in high-level/abstract strategy, they have to agree to one which includes SRE for it to work.
  • SREs are involved early in the service design process and can exert influence. Designing a service to be reliable and observable, among other things, is easier in earlier stages of the design process than post-facto.
  • There is a broad focus on automating repetitive tasks. If you can distinguish many operational tasks which still require manual labor and are not on the agenda to be addressed, likely there is still room for improvement in terms of SRE adoption
  • Both current incidents are mitigated and future ones. Good SRE engineers practice anti-fragility: managing a service that is not only stable under normal operating conditions but also can recover easily when disaster strikes. Additionally, if monitoring is adequately implemented and operational tasks have a high degree of automation, solving incidents should be a lot easier.

The key take-away here is to inspect how well integrated SRE is in the process and how effectively they handle automation, incident response, the degree to which SREs are involved in the design process and whether there is executive buy-in. In other words, is SRE implemented well and is there is support within the organization.

Progression in terms of reliability

Next to the SRE maturity level as a whole, there is also a way to judge the progression in terms of handling reliability. Effectively handling reliability is a key component of any SRE strategy and how it is dealt with is a good indicator on SRE maturity.

  • Reactive - Incidents are responded to after they happen. There is rarely any lessons learned and investments to reduce the number of incidents is limited
  • Proactive - The possibility for incidents are identified before they can occur and measures are taken to prevent them from happening
  • Strategic - Efforts are made to include reliability as an aspect to be specifically addressed in architectures, products and processes
  • Visionary - Reaching beyond the scope of the organization and getting involved in the broader spectrum. For example: Writing papers, giving advice to other organizations and making videos.

The saying "it is better to prevent than to cure" also applies to reliability. So being at the very least proactive to incidents is a good move to make. However, it is not needed to be very high on the spectrum per-se. It requires major investments in order to become a visionary on reliability and the benefits might not outweigh the costs of doing so. 

Summary

The key takeaway here is that there are various ways to judge how your organization fares in terms of SRE maturity. Depending on the maturity level, there are different rates of progression and approaches to advance on this level.

A more 'junior' organization might focus more on making automation more generic, hire consultants to help with determining SLOs and make their incident response more proactive.

A 'senior' organization will focus on developing tooling which comes with its own automation, broaden the scope of their existing SRE teams to include more of the service delivery flow and share knowledge about SRE with other people. Regardless of the maturity level, there are gains to be had but the right approach for the right level will mean gains will be maximized.

 

Related resources

https://cloud.google.com/blog/products/devops-sre/evaluating-where-your-team-lies-on-the-sre-spectrum

https://cloud.google.com/blog/products/devops-sre/the-five-phases-of-organizational-reliability

https://sre.google/sre-book/automation-at-google/