Optimizing the management of alerts from monitoring tools is crucial for efficient operations. However, it can be challenging due to the lack of confirmation on whether subsequent alerts indicate the same underlying problem. This leads to a repetitive and time-consuming process for an organization’s operations team — including site reliability engineers, performance engineers and others — who must manually analyze each alert, often discovering duplicated issues. To address this, organizations are prioritizing automation (66%) and enhancing productivity (61%), as revealed by a recent survey. These statistics emphasize the daily hurdles faced by operations teams.
Consequently, organizations are increasingly adopting Artificial Intelligence for IT Operations (AIOps), which leverages AI to streamline operations and enhance network performance. At Salesforce, our DBAIOps (Database Artificial Intelligence for Operations) team has taken AIOps to the next level and revolutionized database operations by implementing the similarity model.
This model uses advanced techniques such as Cosine similarity and Jaccard similarity to measure the similarity in meaning between two pieces of text. By comparing root causes and assigning similarity scores, this model streamlines incident resolution and marks a significant transition in incident management.
This approach helps identify commonalities among incidents, preventing alert overload and facilitating more effective resolution processes. This ultimately improves operational efficiency and reduces the manual workload for the operations team.
Read on to discover how the similarity model helped DBAIOps overcome its four toughest technical challenges.
Challenge #1: Reducing Alerts and Manual Effort
Anomaly detection systems traditionally generate alerts for each abnormal pattern detected, failing to consider if subsequent anomalies are related to the same root cause as the initial anomaly.
DBAIOps faced challenges with daily influx of alerts across multiple instances often duplicating issues and requiring manual analysis. Identical performance problems in different instances, like SQL-related issues, led to redundant alerts and manual verification by each performance engineer.
To address this, DBAIOps’ similarity model compares the root causes of alerts. By analyzing the current and previous alerts’ Root Cause Analysis (RCA), these models determine if alerts are duplicates, effectively suppressing subsequent cases. Validations showed a 23% reduction in duplicate cases, identifying investigations with shared causes and intelligently ignoring them.
This approach enhances incident management efficiency, reduces manual labor, and minimizes noise, allowing operational teams to focus on resolving actual issues.
Challenge #2: Using Historical Context to Solve New Cases
In scenarios with multiple alerts from different sources, it is crucial to determine if they are related to the same issue. Traditional approaches lack the capability to do this, leading to duplicated efforts and decreased productivity. Each alert is analyzed individually, without the knowledge of their relationship.
To deal with this, DBAIOps’ similarity model automatically tags current investigations with relevant past resolutions if a similar issue has occurred before. Using this Salesforce technology enables the team to track volumes of historical information, enables knowledge sharing, ensures quick access to past resolutions, and improves the incident resolution process. Approximately 50% of proactive investigations were matched to past similar cases through this efficient tagging, streamlining incident resolution.
Challenge #3: Efficient Assignment Triage
Inefficient assignment of engineers with the necessary expertise can cause delays in issue resolution. Previously, investigations are typically assigned to the instance owner by default, with potential reassignment to another engineer based on their availability. However, this approach may overlook important factors like past experience with similar issues.
To tackle this, DBAIOps’ similarity model analyzes historical data and incident patterns to intelligently assign new cases to experts who possess the specific expertise required. This automated triaging process ensures that the right engineer is assigned to each task, leading to faster issue resolution and improved overall productivity. The positive feedback received from the Performance Engineering team further validates the efficacy of our model in accurately triaging cases based on tagged instances, while also reducing the Mean Time To Assign (MTTA).
Challenge #4: Increasing Severity Ranking
Frequent alerts on the same issue can indicate a potential customer incident waiting to happen. By default, proactive alerts are often assigned lower severity levels. However, this approach may not effectively handle recurring incidents.
To resolve this, DBAIOps’ similarity model intelligently ranks incident severity by detecting patterns in incidents with the same RCA. For example, by identifying frequently occurring alerts, the severity ranking can be automatically increased.
The instant update in severity ranking is crucial for efficiently identifying and prioritizing critical incidents, leading to a more efficient resolution process. Our implementation of this model has resulted in a significant 23% improvement in incident severity ranking, enabling quicker actions when incidents occur repeatedly. This means that if DBAIOps has 100 investigations in a month and 23 of them experience frequent alerts until the main problem is resolved, the similarity model recognizes these patterns and recommends increasing the severity of such incidents.
By proactively addressing these high severity alerts, we can minimize the Mean Time To Detect (MTTD) and Mean Time To Resolve (MTTR), thus improving service reliability and availability.
Diving deeper: Understanding how similarity scores are calculated
Understanding how DBAIOps calculates similarity scores is crucial for efficient incident resolution. Here’s a breakdown of the steps involved:
- New alert identified: Our detection runbook systematically gathers data from monitoring tools. If abnormalities are detected during the analysis time range, an alert is triggered. Once an alert is detected, DBAIOps triggers the RCA workflow. This workflow identifies an initial diagnosis, such as determining the type of SQL contributing to the alert or which org is contributing to the issue. This alert marks the start of our investigation and the incident resolution process.
- Data cleansing: The RCA text undergoes a cleansing process to refine it. This includes removing special characters and stopwords to streamline the analysis. Keyword extraction is also performed to enhance the computation of similarity scores.
- Alert comparison: The alert is compared with data stored in the Knowledge Repository, a comprehensive database capturing detailed information about alerts, RCAs, and historical insights. The RCA workflow triggers when an alert is detected, updating the Knowledge Repository with the latest data for accurate comparison. A similarity model generates meaningful scores for efficient alert comparison.
- Score generation: The purpose-built similarity model calculates scores that guide subsequent actions when comparing RCAs.
Learn More
- Hungry for more AIOps stories? Check out how AIOps slashes thousands of manual hours annually in this blog.
- Stay connected — join our Talent Community!
- Check out our Technology and Product teams to learn how you can get involved.