It is now essential for SRE and DevOps teams to rotate people who are on call at all times since users expect issues and outages to be resolved as soon as feasible. How can you design equitable on-call rotation plans that lessen employee burnout?
A timetable that alternates among engineers ensures that there is always someone accessible to respond to problems right away, either by resolving the issue or elevating it to teams that can handle it. This is known as an on-call rotation. Workers are tasked with keeping an eye on the system and responding to situations that could endanger users during the whole day and night. The overall customer experience and business continuity depend on an on-call rotation.
An on-call rotation fulfils numerous essential functions for the company. Businesses of all kinds in North America alone are predicted to lose $700 billion due to downtime.
However, poorly thought-out on-call rotations can lead to staff members' weariness and burnout. Although being available whenever needed is an essential aspect of the job, organisations and teams can collaborate to design a fair and stress-free environment for its members.
Sysadmins and operations engineers handle the majority of the on-call rotations. Site reliability engineers may also be available to teams that adhere to a DevOps culture and site reliability engineering (SRE) methodology.
SREs and on-call engineers are largely in charge of monitoring and resolving issues as part of the SRE culture. They have to keep an eye out for issues and react to any alarms that are reported while they are on call. Changing up the on-call schedule duties could involve resolving server problems, fixing broken code, and other issues that could have a big influence on users and the business.
How is a timetable for being on call made? Examples of on-call schedules include:
After you've decided that your team needs on-call people, there are several approaches to make an on-call schedule and a ton of options for on-call rotation that you can use.
In teams of five or fewer, it might be challenging to design an on-call schedule that prevents alert fatigue or burnout because each engineer must frequently work long shifts to cover the whole period. Making a weekly rotation or alternating day schedule is generally the best course of action.
If it's a small team, you can, whenever feasible, set up third-party monitoring, like an on-call backup. To make sure that workloads are balanced, you may also set up a rotating backup schedule for three or more people.
The "following the sun" paradigm is a crucial illustration of an on-call rotation plan. If your team is large enough to work across multiple time zones, you can establish a distributed work schedule depending on locations. In this manner, the on-call rotation is maintained and team members are able to take up on-call monitoring when it isn't their night.
It's worthwhile to look for additional support from different time zones to handle some or all night shifts, even if the team isn't large enough to support such scheduling.
It's critical to take service delivery accountability into account when developing on-call schedules. Regretfully, this might have a substantial effect on the amount of work that the on-call engineer can complete because they might not have all the equipment and expertise required for every service.
It is feasible to divide up on-call responsibilities among various teams and services in order to minimise interruptions. Using tools like automated runbooks, you can automate as much as you can to develop uniform processes for issues and lessen the strain for the on-call personnel.
The number of rotations the team does will depend on their workload and size, however there are a few possibilities. You may use semi monthly calendars, for instance, in which team members alternate based on specified tasks.
Another option is to create a week/weekend plan in which team members rotate through throughout the week, then over the weekend, before switching. Ad hoc scheduling and rotation can also be created based on demands, but they are frequently difficult to maintain and impractical in the long run.
Creating an on-call schedule can take numerous forms, depending on the requirements and size of the team. On the other hand, there are best practices that can assist you in creating an on-call rotation that is customised for your requirements if you are unsure of how to carry out on-call rotation.
Be in touch frequently and early.
It's critical to initially collaborate with the team to ascertain their goals before taking on any models.
Enforcing individuals to be on call without their input into the schedule will ultimately have a negative impact on employee productivity and well-being. By including everyone in the process, you maintain transparency and give everyone the opportunity to offer input. For instance, if some employees would rather work later than others, this could have an impact on scheduling. Alternatively, they can recommend rotation plans, such one week on/one week off, based on workloads and expectations.
When engineers receive too many notifications and are unable to appropriately prioritize and respond, it is known as alert fatigue. Determine which alerts are necessary for the on-call team members to receive and which ones can be automated. Teams have to evaluate which occurrences require immediate attention and which ones can wait until the next day.
When team members are on call, especially during odd hours when they can't simply ping other team members with queries, you don't want them to feel lost or unsupported. As a result, it's critical to establish roles and expectations early on and to keep records accessible for team members.
Is being on-call, for instance, equivalent to reacting to alarms as they arise? Or does it entail proactive monitoring of some sort? In the event that team members are unable to resolve the issue right away, what should they do? Runbooks and established protocol help everyone get access to the knowledge they need to handle problems swiftly and efficiently while also reducing the midnight panic.
Working night shifts can be taxing and demanding, which can have a bad effect on an employee's health and wellbeing. Try to avoid scheduling overnight shifts as much as possible, unless staff actively want them. If working the night shift is just not possible, set up explicit guidelines and procedures for when to communicate with people after work.
Provide flexible hours in the morning to workers who are taking on the night shift so they won't feel like they are working nonstop. Employees should be encouraged to voice their concerns if they believe that on-call rotations are negatively affecting their ability to rest and sleep.
Trying to balance the demands of both on-call duties and daily job might leave on-call team members burned out and exhausted. Providing more assistance and flexibility is essential and greatly enhances worker happiness and wellbeing. Reduce some of your team members' to-do list by designing workflows that don't depend as much on them during the day when they are on call.
It could be necessary for other team members to split the workload evenly between those who are on call and the rest of the team. As far as possible, give teams leeway, and encourage them to work together to find solutions that will lower bottlenecks all around.
To make sure that team members aren't spending a lot of time on pointless tasks, a lot of on-call labour can be automated. For common and/or small issues, for instance, some monitoring and alert replies can be automated.
To minimise the steps required to handle an incident on-call, you can leverage incident response solutions to construct automated workflows, such as runbook automation. Team members can perform better when on call because they won't receive constant notifications for problems that aren't really necessary.
Don't assign it to a single function.
Certain on-call schedules place an undue burden on operations engineers or particular teams, which may cause problems in the future.
It's imperative to take imbalances in on-call rotations into account because one individual or a small team cannot realistically manage a huge structure without finding difficulties. Part of the workload can be decreased by having an SRE team or adding incident response tools to the on-call roster.
Being on call doesn't have to be stressful if you have the correct resources. By providing teams with the necessary tools to collaborate, monitoring, and notifications, Pagerly makes on-call rotations easier. By centralising context, assigning tasks, and storing event data in one location, Pagerly automates incident response.
In order to assist teams concentrate on what's important and automate the rest for a more seamless incident response process overall, Pagerly maintains checklists, runbooks, and more. Every detail, including the actions taken, is recorded and preserved so that teams have the resources they need to further workflow and process improvement in the future.
Preserve data in real time and identify critical actions for retrospectives to maintain team collaboration without placing blame. Make an appointment for a demo to find out more about how Pagerly enhances the on-call experience.