For many engineering and operations teams, being available when needed is essential to maintaining the dependability and availability of their services. One of the main responsibilities is to assist in meeting different SLAs. The key principles of on-call work are discussed in this article along with real-world examples from the industry on how to plan and carry out these tasks for a worldwide team of site reliability engineers (SREs).
When there is a person on call, they are available for production incidents promptly and can prevent a breach of service level agreements (SLAs) that could have a major negative impact on the organisation. An SRE team usually dedicates at least 25% of its time to being on call; for instance, they might be on call for one week out of every month.
SREs are engineering teams that, historically, have approached on-call work as more than merely an engineering challenge. Some of the most difficult tasks are managing workloads, scheduling, keeping up with technological advancements, and managing work. Any organisation must also instil the culture of site reliability engineering.
The following are the essential components that SRE teams must take into account in order to successfully handle on-call shifts.
Traditionally, the purpose of SRE teams has been to maintain complicated distributed software systems. These systems could be set up in a number of data centres worldwide.
SRE teams are typically dispersed throughout different regions in order to benefit from the different time zones. The org chart below illustrates one possible configuration for the SRE team: regional SRE managers would report to a global SRE director. The engineers that handle the majority of the on-call duties are the SREs' respective supervisors, to whom they subsequently report.
The "follow-the-sun" approach can be used to plan on-call shifts if the team is large enough. Let's take an example where the corporation is based in Chicago and operates on US Central Time (CT). The SRE team in the US might be available from 10 a.m. to 4 p.m. CST if they were a distributed team. The Sydney crew then goes on call from 8 a.m. to 2 p.m. Sydney time (4 p.m. to 10 p.m. CST) because by then the Australian working day had begun. After that, the Singapore squad would rotate in third place, taking on on-call duty from 11 a.m. to 5 p.m. Singapore time (10 p.m. to 4 a.m. CST). The London team takes up the on-call duties from the Singapore team at 9 am London time and continues until 3 pm London time (4 am to 10 am CST), at which point it turns the shift back over to the US team at 10 am CST. This completes the London team's shift cycle.
A "handover" procedure is required at the conclusion of every shift, during which the team taking over on-call responsibilities is informed about on-call matters as well as other pertinent matters. For a total of five consecutive working days, this cycle is repeated. If traffic is less than it is during the week, SRE on-call staff could work one shift per weekend. An extra day off the following week or, if they would choose, cash payment should be given to this person as compensation.
While most SRE teams have the aforementioned structure, some are set up differently, with extremely small satellite teams supplementing a disproportionately big centralised office. If the duties are distributed among several regions in that case, the small satellite teams could feel overburdened and excluded, which could eventually lead to demoralisation and a high turnover rate. The expense of having to deal with on-call concerns outside of regular business hours is then seen as justified by having complete ownership of responsibilities at a single place.
The on-call schedule could be created by dividing the year into quarters, such as January to March, April to June, July to September, and October to December, if the company does not have a multi-region team. One of these groups should be allocated to the current team, and every three months, the nocturnal work effort should be rotated.
It is best to have a timetable like this to support the human sleep cycle and have a well-structured team that cycles every three months rather than every few days, which is more strenuous on people's schedules, as it is not good to be on call a few days per week.
Managing the personal time off (PTO) plan is essential since the availability and dependability of the platform as a whole is dependent on the SRE position. On-call support needs to take precedence over development work, and the team size should be large enough for those who are not on call to cover for absentees.
Assume that the regional SRE manager in North America (SRE MGR NA) is overseen by five SREs. During the week, two SREs will be on-call in accordance with the regular schedule, which is outlined in the section above. Work on development will be done by the remaining three SREs. An on-call SRE will take over for a development SRE in the event of an emergency. The dev SRE should receive financial compensation for working on call, or their rotation on call during the next shift should be changed.
There are local holidays specific to each geographic place, such as Thanksgiving in the USA and Diwali in India. Globally, SRE teams should be allowed to switch shifts at these times. Common holidays around the world, including New Year's Day, should be handled like weekends with minimal staff support and a slack pager response time.
Every shift begins with a briefing given to the on-call engineer about the major events, observations, and any outstanding problems that need to be fixed as part of the handover from the previous shift. Next, the SRE opens the command line terminal, monitoring consoles, dashboards, and ticket queue in preparation for the on-call session.
Based on metrics, the SRE and development teams identify SLIs and produce alerts. Event-based monitoring systems can be set up to send out alerts based on events in addition to data. Consider the following scenario: the engineering team and the SREs (during their off-call development time) decide to use the metric cassandra_threadpools_activetasks as a SLI to track the performance of the production Cassandra cluster. In this instance, the Prometheus alert management YAML file can be used by the SRE to configure the alert. The annotation that is highlighted in the sample below can be used to publish alerts. One way to interface with contemporary incident response management systems is to utilise this annotation.
The Prometheus alert manager forwards the alert to the incident response management platform when the alert condition is satisfied. The on-call SRE engineer needs to examine the dashboard's active task count closely and ascertain the cause of the excessive task count by delving into the metrics. Corrective action must be taken after the reason has been identified.
Another illustration is event-based monitoring, where a ping failure alert is generated, for instance, if a host or cluster cannot be reached. If it turns out that there is a network problem, the SRE should look into it and report the issue to the networking team. This procedure can be combined with an up-to-date incident response management platform to identify problems and use runbooks to take corrective action.
Global integration of these alerting systems with a ticketing system, such Atlassian's Jira ticket management platform, is recommended. Every alert ought to generate a ticket on its own. The SRE and all other stakeholders who responded appropriately to the alert must update the ticket when it has been handled.
The SRE should have a terminal console available to use SSH or any other CLI tools provided by the organisation at the start of the on-call time. The engineering or customer service staff may contact the on-call SRE for assistance with a technical problem. Assume, for instance, that a user's action on the platform—such as hitting the cart's checkout button—produces a distinct request ID. The distributed system's various components—such as the database service, load balancing service, compute service, and others—all interact with this request. The SRE may receive a request ID; when receiving it, they are supposed to provide information about the request ID's life cycle. Some examples of this data include the machines and components that logged the request.
An SRE might be needed to look into network problems if the problem isn't immediately evident, like in the scenario where the request ID stated above wasn't registered by any service. The SRE may use TCPDump for Wireshark, two open-source packet analysis programmes, to record packets in order to rule out any network-related problems. The SRE may enlist the assistance of the network team and ask them to examine the packets in this laborious task.
When troubleshooting such challenges, the cumulative knowledge gathered from the varied backgrounds of SREs in a given team would undoubtedly be helpful. As part of the onboarding process, all of these actions ought to be recorded and subsequently utilised for training new SREs.
The on-call SRE should be in charge of the deployment procedure during business hours. In the event that something goes wrong, the SRE need to be able to resolve the problems and reverse or forward the modifications. SRE should only perform production-impacting emergency deployments and has sufficient knowledge to assist the development team in preventing any negative effects on production. The change management process should be closely integrated with deployment procedures, which should be thoroughly documented.
SREs keep a close eye on the tickets that are being routed through their queues and are awaiting action. These tickets have a lesser priority than those created by ongoing production issues since they were either generated by the alerting programme or escalated by other teams.
It is usual procedure for every ticket to provide a statement regarding SRE actions that have been completed. The tickets have to be escalated to the appropriate teams if nothing is done about them. When one SRE team passes over to another, the queue should ideally be empty.
SREs who are on call are most affected by these problems. The on-call SRE must use all of their skills to swiftly resolve the issue if the monitoring and alerting software fails to notify them or if there is an issue that is undiscovered and only discovered when the customer reports it. The relevant development teams should also be enlisted by the SRE to help.
After the problem is fixed, all pertinent stakeholders should be given a ticket with comprehensive incident documentation, a summary of all the actions performed to fix the problem, and an explanation of what could be automated and notified on. Above all other development effort, this ticket should be developed first.
Transfer protocols
During every transition of SRE responsibilities, there are certain conventions that must be observed to enable a seamless handover of the on-call responsibilities to the next personnel member. One such standard is giving the person who is next on call a summary packet. Every ticket in the shift has to be updated and documented with the steps taken to resolve it, any additional comments or queries from the SRE staff, and any clarifications from the development team. These tickets ought to be categorised according to how they affect the dependability and availability of the platform. Tickets that affect production should be marked and categorised, particularly for the post-mortem. Following its compilation and classification, this list of tickets by the on-call SRE ought to be uploaded to a shared communication channel for handover. The team may decide to use a different Slack channel, Microsoft Teams, or the collaborative features found in incident response platforms like Pagerly for this. It should be possible for the entire SRE organisation to access this summary.
Every week, all of the engineering leads and SREs attend this meeting. The format of post-mortem sessions is outlined in the flowchart that follows.
Making sure the problems highlighted don't happen again is the post-mortem most crucial result. This is not an easy task to accomplish; the corrective actions could involve anything as basic as creating a script or introducing more code checks, or they could involve something more complex like redesigning the entire application. To come as near to the objective of preventing recurring issues as possible, SRE and development teams should collaborate closely.
Every time a problem arises, a ticket with the necessary action items, documentation, and feature requests needs to be filed. It is frequently unclear right away which team will handle a specific ticket. Pagerly routing and tagging features make it possible to automate this process, which helps to overcome this obstacle. It may begin as a customer service ticket at first, and based on the type and severity of the event, it may be escalated to the engineering or SRE teams. The ticket will be returned to customer service after the relevant team has addressed it. The resolution may involve communicating with the customer or it may just involve documenting and closing the issue. Furthermore, this ticket will be reviewed in the post-mortem for additional analysis and be a part of the handoff procedure. To lessen alert fatigue, it is recommended practice to accurately classify and allocate the issue to the highest level of expertise.
Paging occurs often among members of on-call engineering teams. In order to minimise the frequency of paging, pages must be carefully targeted to the right teams. When there is a problem that warrants a page on the internet, the customer service team is typically the first person to contact. From that point on, as was covered in the previous part, several investigations and escalation procedures must be taken.
Every alarm needs to have a written plan and resolution procedures. When a client reports a problem, for instance, the SRE should check to see whether any alerts are still running and determine whether they are connected in any way to the customer's problem.
Restricting page views of the SRE to genuine good matters should be the aim. When it is impossible to decide which team to page, a central channel of communication must be established so that SRE, engineering, customer service, and other team members can participate, talk about, and address the problem.
A runbook is a summary of the precise actions—including commands—that an SRE engineer has to take in order to address a specific incident, as opposed to a cheat sheet that lists every command for a certain platform (like Kubernetes or Elasticsearch). When solving difficulties, time is of the essence. If the engineer on call is well-versed in the subject and has a list of instructions and action items at hand, they can implement solutions much more rapidly.
As a team, you may administer a single runbook centrally, or each person could prefer to have their own. Here's an example runbook for you.
One could argue that a large number of the tasks in a runbook could be automated, carried out with Jenkins, or built as separate continuous integration jobs. Although that is the best case scenario, not all SRE teams are experienced enough to carry it out flawlessly. An effective runbook could also be a helpful manual for automating tasks. An engineer's requirement for the command line is constant, thus anything that must be regularly entered into a terminal is a perfect choice for a runbook.
Any technical team wishing to implement new features must obtain SRE permission beforehand. SRE approval is required for anything that enters production, and dependencies, load testing, capacity planning, and disaster recovery should all be covered in detail.
For the platform feature and associated alerting to go into production simultaneously, development tickets that are required for the SRE's monitoring and alerting should be made. Ideally, the adjustments should be implemented in a canary or A/B method at first. Changes should be planned, implemented, scheduled, and managed using a platform accessible to the entire organisation.
A change request (CR) created by an engineering team member is forwarded to upper management and the SRE team for approval. When creating a pull request (CR), the author is required to include all relevant tickets and documentation regarding the potential effects of the change on the production system. After receiving clearance from the SRE and management, the CR is planned to be executed according to the time requested by the opener.
To guarantee the platform's stability when the CR is implemented, functionality and platform checkouts must be carried out. Should something break, the SRE should roll it back right away. Following a successful execution, it ought to be closed and documented on the main change management platform used by the entire organisation.
Appropriate documentation for SRE tools and services is essential; it facilitates both the onboarding of new engineers and the upskilling of current employees. Maintaining and updating documentation takes a lot of work, but the effort pays off. A skilled SRE should be familiar with all distributed software systems and have a basic understanding of the following:
It makes more sense to comprehend the fundamental SRE services, such as development, automation, and monitoring, after the aforementioned fundamentals have been grasped. If the company uses any open-source monitoring systems, it is important to read the installation, customisation, and deployment instructions and record the necessary notes for doing so.
After gaining all the necessary knowledge, it's a good idea to practise these newly developed abilities in non-production internal contexts. This offers an additional chance to rectify any errors in the documentation.
After gaining all the necessary technical knowledge, the SRE trainee will be permitted to observe more experienced team members during on-call sessions. In order to get practical knowledge of the on-call procedure, the new SRE shadows the principal on-call engineer. SREs should be assigned increasing amounts of responsibility as they gain knowledge until they are able to handle on-call tasks on their own.
Creating an on-call rotation plan that works and is sustainable requires careful thought. These include preparing shift changes and scheduling, creating an escalation mechanism, and holding post-mortem analysis meetings following every event. Pagerly is made to plan, carry out, monitor, and automate duties related to standard site reliability engineering responsibilities, such as on-call rotations.