Geschreven door Luuk Tersmette

Challenges of implementing SRE in small to medium businesses

DevOps4 minuten leestijd

Site Reliability Engineering (SRE) is a new discipline which is quickly gaining traction within the IT landscape. First conceived by Google, it is now adopted mainly by large enterprises such as Netflix, LinkedIn, Target, Tesla and many more. It is known for a focus on automating manual tasks without enduring value (toil), treating operations as a software problem and its unique approach to systems reliability.

Differences in SRE practices

There are many ways to implement SRE. SRE is a multi-faceted discipline. You can create Service Level Objectives to define the level of service to deliver to users, focus on reducing toil with targeted measures, define a risk tolerance for a service and do release engineering to make releases streamlined and consistent.

These, amongst some other practices, primarily dictate a certain way of working. However, some practices are more focused on designing systems to be reliable, highly available and robust. In other words, I see two categories:

  1. Practices which are primarily ways of working; They also have technical implications, but the largest part is how we interact as employees and/or set priorities in what we find important to work on.
  2. Practices which are primarily ways of designing systems; The key aspect of these are designing an application according to certain guidelines such as reliability, availability, robustness, etc. Subsequently the application is created to function to reach these goals.

Advantages of scale

Not all practices are as easy to apply for a given business. A lot of Googles practices which it has written about in its SRE Handbook has stemmed from dealing with its growing pains. As the organization grew it required cleverer solutions so it could continue to deliver a good service while using resources effectively. Small to medium businesses however do not have some of the challenges and use resources differently. Some examples:

  • They often use off-the-shelf software as opposed to tailor-made solutions
  • Teams are more converged. They often operate within one region and have a smaller selection of services to deliver
  • They have a lesser variety of roles in a team
  • They have less resources to their disposal
  • Very highly skilled people tend to flock to large prestigious enterprises as opposed to smaller companies

Implications on SRE implementation

Given the above differences, the way small to medium business can implement SRE differs. For example:

  • While using off the shelf software, everything comes at it is. Some practices of SRE dictate how to design and build applications according to certain standards but this is not something you do in this case
  • Even when software is written for systems administration, it is likely either a smaller application or a script, as opposed to larger complicated systems
  • Some of these system design principles can be complicated to grasp and to implement, such as distributed consensus and distributed periodic scheduling. The employees working at the company might not have the skills necessary to work with these principles
  • These kinds of businesses might not scale as fast and investments to a high degree of scalability might not pay off for them

Working around limitations

Even though mostly off-the-shelf software is used, you can still apply a lot of the principles in SRE. Instead of using them to design your own software, try to understand the principles and buy software which is built with these principles in mind. For example, you can buy software which has high availability as a feature. 

Having the ability to create your own software gives more opportunities to reduce toil as you have more control over the way your infrastructure is managed. However, you can do a lot with scripts, pipelines or small applications. Doing a routine clean-up in an external system could easily be encapsulated in a Python script ran by a cronjob.

Many principles can be implemented with off-the-shelf software and widely available documentation. Service level objectives can be monitored in many monitoring solutions such as DataDog and Splunk. Having engineers be on-call effectively can also be done with tools such as VictorOps.

Your business might not benefit from all the scale advantages of companies like Google, but investments in scalability can still pay off. Using container orchestration for workloads means scaling applications happens in a consistent and controlled manner. Cloud solutions offer comparable solutions for VMs. Having good control over scaling also helps during maintenance as you can easily scale down a node and perform maintenance temporarily.

In conclusion

Google describes a lot of principles in their SRE Handbook. The base principles upon which they are based are universally applicable but the degree to which they can be implemented and how it is done is likely very different when comparing smaller businesses to large enterprises such as Google. Still, by choosing software which adheres to the principles described and provides features for utilizing SRE, using smaller pieces of software for automation and implementing practices to the appropriate degree, your business can also benefit from SRE.