Effective Slack on-call protocols for engineers

Falit Jain

July 25, 2024

•

5 min read

Table of Content

Heading H2

Heading H3

Talks about being on call are usually met with complaints. Here's how to alter the narrative and develop a stronger, more compassionate process.

A few years ago, I took oversight of a significant portion of our infrastructure. It was a complex undertaking that, if not managed and regulated properly, could have resulted in major disruptions and economic consequences over a large area. For these kinds of projects, you need well-equipped engineers who understand the dependencies and systems, as well as a robust team that is available for emergencies during off-peak hours.

In my experience, I came across a well-intentioned staff that was unprepared for the erratic nature of on-call work. My intention was to transform the on-call experience from a constant source of anxiety into a model of stability.

Being on call might cause anxiety: According to the Hones, being on call is often one of the most stressful aspects of a developer's job and a significant socio-technical challenge. This idea stems from a few basic problems that could make being on call particularly difficult.

Unpredictable work schedule: Engineers must be available to respond to emergencies after hours in order to be deemed on-call. Interference with personal time results in an unstable work-life balance.

High stakes and high pressure: Managing failures that could significantly affect corporate operations is a requirement of being on call. Because of the stakes, there can be a lot of tension. After work, handling challenging issues on your own increases the feeling of loneliness.

Lack of preparation: If engineers do not obtain the required training, preparation, and experience, they will feel unprepared to handle any issues that may arise. This exacerbates the issue by instilling anxiety and a dread of making the incorrect decisions.

Alert fatigue: When non-critical alerts are received frequently, engineers may become desensitised to them, which may make it more difficult for them to recognise significant issues. Increased stress, slowed reaction times, the failure to notice critical alerts, and reduced system reliability could arise from this. This causes a general dissatisfaction with their work and hinders their ability to take swift action when an actual issue does occur.

Why is it vital to have on-call protocols?

Despite its challenges, being on call is essential for maintaining the stability and health of production systems. Someone needs to cover your services during off-peak hours.

Being close to production systems is essential for on-call engineers since it ensures:

Corporate continuity: Fast response times are crucial to minimise downtime and minimise the impact on end users and corporate operations. They differentiate between a minor disruption and a major outage.
‍Deep understanding of systems: You can completely appreciate the nuances of the production environment when you are fully integrated into the system. It's advantageous to foster a culture where team members place an emphasis on problem prevention and performance optimisation rather than just reacting to issues when they arise.
‍Improving soft skills: On-call engineers must possess a variety of talents, including crisis management, quick decision-making, and effective communication. These skills are useful for career progression in a greater variety of work environments.
‍Adjusting to a Fearless On-Call Hierarchy: Upon initially acclimating to the on-call system, my team observed a disorganized collection of makeshift fixes and a glaring lack of confidence in the deployment process. Despite their resilience, the engineers were solitary in their work, making being on call every night a lonely (mis)adventure. It also became clear that, without more preparation and support, a significant outage due to our system's vulnerability was imminent. We had a ticking time bomb on our hands.

Our transformation started with simple, fundamental first steps.

Creating a Checklist for Before Being On Call:

A pre-on-call checklist is an easy yet efficient way to ensure engineers have completed all necessary activities before starting their shift. It prevents errors, reduces the likelihood of being unprepared, and promotes a proactive approach to incident management. The checklist we developed consisted of the following categories, each with specific assignments:

Instruction at the Squad Level: Training based on the squad's call-out responsibilities. This includes role-specific assignments, typical on-call procedures, familiarizing yourself with necessary tools and architectures, and simulation exercises.
Records of Onboarding: Maintain accurate documentation so that everyone understands their roles and duties during the on-call process. This should include comprehensive explanations of every role, procedures for handling incidents, escalation paths, and crucial contacts. It's essential to ensure this documentation is readily available and regularly updated.
Protocols for Handling Incidents: Guidelines for monitoring and responding to incidents should be comprehensive, including organizational policies, training curricula, and escalation procedures.
Plan of Action for On-Call: Clearly describe the on-call rotation, including who is available and when. Additionally, ensure that all scheduled shifts are added to each engineer's calendar using the appropriate scheduling tools. This keeps everyone informed and prepared.
Resources and Availability for Emergencies: Ensure that engineers have access to all equipment and privileges required for on-call duties. This may include monitoring dashboards, AWS queues, and OpsGenie.
The Method Used After Death: Establish or refine a procedure for event analysis and learning. This process should include post-mortem sessions where incidents are thoroughly investigated to identify root causes and opportunities for improvement. A blameless mindset is crucial, with the focus on learning and development rather than assigning blame.
Emergency Contacts: Make sure that each person on the on-call team has access to a list of emergency contacts. It's essential that the team knows how and when to contact these services.
Frequent Assessment: Attend regular review sessions for your squad or organization to assess operations, stability, and the on-call process. These discussions should take place regularly (e.g., every two weeks, every month) to ensure continuous improvement. Regular reviews promote effective operations and help enhance the on-call protocol.

After the engineers felt at ease using the pre-on-call checklist, we added them to the rotation. We warned them that while they understood the challenges in theory, handling a real crisis could be more complex and demanding.

Lastly, ensure that employees are accurately included in any payment systems your organization may have for on-call engineers. Pagerly is an excellent resource for exploring various on-call payment options.

Wheel of misfortune: Get comfortable with role-playing games

We introduced the "Wheel of Misfortune," a role-playing game inspired by Google's Site Reliability Engineering (SRE) process. The idea is simple: we simulate service disruptions to assess and improve on-call engineers' response times in a secure environment. Some of these drills are conducted to help teams become better prepared to deal with real-world catastrophes.

In a 2019 post, systems engineer Jesús Climent of Google Cloud wrote, "If you've ever played a role-playing game, you presumably already know how it works: A scenario is run by a leader, such as the Dungeon Master, or DM, in which certain non-player characters encounter the players, who are the ones playing, in a predicament (in our example, a production emergency) and engage with them."

It's a useful method to ensure that engineers can confidently and skilfully handle significant yet unusual events. Here are the steps to enhance on-call readiness -

Step 1: Verification of data

Making sure we have the right data for helpful dashboards and alerts was the next step. This was important because having relevant and reliable data is essential for effective monitoring and incident response. We thoroughly investigated the accuracy of our initial assumptions about what makes "good" dashboards and alerts. Verifying the quality and relevance of data helps ensure that warnings are reliable indicators of actual issues and avoid false positives.

An efficient on-call system depends on its ability to record and analyze important business KPIs or system performance data from incidents. We ensured that each service had the proper metrics, routinely logged them, and maintained a dashboard for quick overviews.

Step 2: Enhancing notifications and dashboards

As alert fatigue is a common issue with on-call systems, we revamped our monitoring protocols to address it:

Review of Alerts: Regular evaluations of false positives were implemented to reduce unnecessary noise and adjust alert sensitivity. The main objective was to prevent desensitisation that can arise from repeatedly receiving irrelevant information.

Examining the Dashboard: We began ranking dashboards to determine which ones have the best widgets, access, and overall visibility. A few key questions we considered were: Is the dashboard readable when it needs to be? Is it actionable? Is it being used? In a recent blog article, Adrian Howard provided a great set of questions to further filter dashboards systematically.

Using Golden Signals: We used our golden signals dashboard to track traffic, error rates, and other business metrics. The four golden signals are traffic, delays, saturation, and errors. Latency measures response times; saturation records resource utilization; errors track failed requests; and traffic measures demand. This facilitates the process for on-call engineers to assess the health of the system and understand the implications of any changes or issues.

Step 3: Increasing awareness

After the team developed a thorough understanding of our deployments, flow, and health data, we began searching for further areas to optimize.

Recognizing and Monitoring Your Dependencies: Understanding upstream and downstream dependencies is essential for incident response. We created architecture diagrams to make the information flow easier for everyone to understand. These diagrams demonstrated how different components interact, such as services A, B, and C, where service A depends on service B, and service B depends on service C. We then installed dashboards to monitor these dependencies, allowing us to quickly identify and address any unusual activity. This made monitoring interdependencies within the system simpler.
Communication Regarding Dependencies: In the event of an emergency, we ensured that each engineer knew how to contact the teams and services they relied on.
Runbooks: We improved our runbooks by making them more accessible. Each runbook included sufficient details and example scenarios to enable experienced on-call engineers to start resolving problems immediately.

Step 4: Reducing Stress for Those Who Are On Call

We started optimizing the areas that needed more effort or had knowledge gaps as soon as things normalized.

Ritualisation of Handover: We turned the handover process into a formal procedure. Before the on-call shift begins, the departing engineer conducts an official review of the system state, current problems, recent events, and any risks. This ensures that the new engineer is well-prepared for their shift.
Acknowledging On-Call Successes: We began recognizing on-call successes. During each post-on-call evaluation, we acknowledged the challenges faced and the accomplishments made. This could be anything from handling a serious incident to proactively resolving a problem before it escalates.

Step 5: Continuous Improvement and Knowledge Sharing

Regular Post-Event Assessments: We carried out objective, regular post-incident reviews to improve and learn from every experience. Every session with the involved on-call team included a root-cause analysis focused on understanding issues rather than assigning blame.
Monthly Operations Review Meetings: To keep our on-call protocols current and improved, we host monthly operational review meetings. The agenda items for these meetings include:
- Reviewing Previous Action Items: Evaluate the follow-up actions from the last meeting.
- Rotations for On-Call Status: Assess how well the current on-call mechanism is working, including evaluating the group's performance, ensuring all shifts are covered, estimating the workload, and adding new members to the rotation.
- Current Affairs: Discuss any events that have occurred since the last meeting.
- Evaluations of Alerts: Review the Mean Time to Recovery (MTTR) reports.
- Discovered Lessons: Share insights from alerts or recent events.
- Pain Points: Discuss the challenges the group is facing and devise solutions to address these issues.
‍Knowledge Base and Documentation: Our knowledge base contains runbooks, playbooks, and documentation. With regular updates and integration with our monitoring systems, it provides incident-specific guidance to on-call personnel.

What happens after that?

As soon as the firefighting stops and you have some breathing room, you should start aggressively investing in SLOs (Service Level Objectives) and SLIs (Service Level Indicators). They are the only way your team can begin behaving proactively and cease always responding to everything that happens. SLOs, such as 99.9% uptime, outline the ideal degree of service dependability. specified uptime %s, or SLIs, are metrics that show you how well you're achieving specified objectives.

By creating and adhering to SLOs and SLIs, your team may cease patching problems all the time and focus on achieving predefined performance goals.

‍‍

Final thoughts

The engineering team has the responsibility to be available and accountable for their code, while management is obligated to ensure that on-call work is done effectively. There is reciprocity in this relationship.

Engineering executives can showcase successful on-call strategies by examining how they organize and prioritize their systems. Your company’s previous successes and specific needs will determine your on-call strategy. When adjusting your plan to meet the needs of your team, consider your own requirements and what you can reasonably do without.

By being proactive, you can maintain stability while also raising the bar for operational resilience. However, this approach requires ongoing maintenance, care, and continuous improvement. The good news is that a practical solution is within reach.

‍