Written by O’Reilly Media
Present-day information systems have become so complex that troubleshooting them requires real-time performance, data presented at fine granularity, and a thorough understanding of data interpretation. This complexity highlights the need to avoid alert fatigue, where operators are overwhelmed by excessive alert notifications. The era when failures could be traced to a few causes is long gone. Industry availability standards remain high and continue to rise, so network management solutions must evolve to provide real-time monitoring and actionable alerts. Systems must be equipped with powerful instrumentation, as lack of information leads to lost time and, in some cases, lost revenue. Modern cloud platform capabilities and continuous monitoring further enhance real-time insights, enabling faster responses to critical issues.
Monitoring empowers operators to catch complications before they become problems, helping you maintain high availability and deliver quality service. By setting accurate alert thresholds, organizations can minimize false positives that unnecessarily distract operators. It also helps inform decisions about the present and future, serves as input to infrastructure automation, and is an indispensable learning tool. Proper network alert management reduces alert noise and ensures essential issues are promptly addressed, preventing performance issues that could lead to downtime.
Monitoring, Alerting, and What They Can Do For You
Monitoring has become an umbrella term whose meaning depends on context. Most broadly, it refers to the process of becoming aware of a system’s state, whether dealing with critical alerts about disk space or performance monitoring for resource usage. This is done in two ways: proactive and reactive. Proactive monitoring involves watching visual indicators, such as time series and dashboards, and is often used to detect anomaly signatures early. Reactive monitoring involves automated notifications to operators about significant changes in system state and is usually referred to as alerting. Together, monitoring and alerting enable quick action on potential critical issues.
Ambiguity still exists. On forums and mailing lists, some use the term monitoring to refer to measurement processes that may not involve human interaction. The point is, when reading about effective monitoring, it is useful to discern as early as possible what process is being discussed. This helps businesses build robust monitoring and alerting systems that reduce alert fatigue and ensure operator attention is focused on actionable alerts rather than low-priority notifications.
Some goals of monitoring are more obvious than others. To demonstrate its full potential, here are the most common use cases, connected to overseeing data flow and change in your system. In many organizations, network monitoring alerts play a key role in identifying problems with network performance and server monitoring. These alerts, triggered by network conditions, can be refined with dynamic thresholds to filter out transient spikes that do not represent real threats.
Defining, Monitoring and Alert
Because there are many ways to view these activities, here are formal definitions to help put each activity in context. When done right, monitoring and alert management help reduce alert noise and ensure better responsiveness to critical issues.
Monitoring is the process of maintaining surveillance over the existence and magnitude of state changes and data flow in a system. Its aim is to identify faults and assist in their elimination. Techniques used in monitoring information systems intersect the fields of real-time processing, statistics, and data analysis. A set of software components used for data collection, processing, and presentation is called a monitoring system.
Alerting is the capability of a monitoring system to detect and notify operators about meaningful events that denote a significant change of state. The notification is referred to as an alert and may take multiple forms: email, SMS, instant message, or phone call. The alert is sent to the appropriate recipient responsible for addressing the event and is often logged as a ticket in an Issue Tracking System (ITS). Effective alerting relies on well-tuned thresholds and historical data analysis to reduce alert fatigue and focus attention on genuine emergencies.
Early Problem Detection
Speedy detection of threatening issues is the most important objective of monitoring and is the function of alerting. Dynamic thresholds help differentiate normal fluctuations from critical incidents, lessening the burden of constant alerts. The challenge is pursuing two conflicting goals: speed and accuracy. You want to know when something is not right and you want to know quickly, but you do not want to be alarmed by temporary blips or transient issues of negligible impact. By striking the right balance, you can reduce alert noise from short-lived anomalies. Behind every reasonable threshold value lurks a risk for potentially disastrous issues slipping under the radar. This is why setting up alarms manually is difficult and speculating about the right threshold levels in meetings can be exhausting and unproductive. Incorporating machine learning for anomaly detection can help tune alarm levels, making monitoring alerts more accurate. The goal of effective alerting is to minimize hazards.
Availability
In the business of availability, downtime is a dreaded term. It happens when the system loses full availability, which can also be partial or affect only some users. The key is early detection and prevention in busy production environments, especially using network monitoring alerts that illuminate potential bottlenecks in real time.
Downtime usually translates directly to revenue loss. A complete monitoring setup that allows for timely identification of issues is indispensable. With continuous monitoring, operators can promptly address critical alerts, maintain server monitoring best practices, and control performance issues. Ideally, monitoring tools should enable operators to drill down from a high-level overview to fine levels of detail, granular enough to point at specifics used in analysis and identification of root causes.
The root cause establishes the real reason (and its many possible factors) behind the fault. Corrective action builds upon findings from root cause analysis and is carried out to prevent future occurrences. Fixing only the superficial problem guarantees recurrence of the same faults in the long run. By leveraging robust network alert management and performance monitoring data, operators can make well-informed decisions and ensure consistent uptime.
Performance
Paying close attention to anomalous behavior in the system helps detect resource saturation and rare defects, such as spikes in disk space usage or unusual response times. Some faults get past Quality Assurance (QA), are hard to account for, and are likely to surface only after long hours of regression testing. A peculiar group of rare bugs emerge exclusively at large scale under heavy system load. Although hard to isolate in test environments, they are consistently reproducible in production. Once located through scrupulous monitoring, they are easier to identify and eliminate.
By examining network performance regularly and setting relevant alert thresholds, organizations can preemptively tackle performance bottlenecks before they become full-blown crises. Metrics monitoring and alerting can highlight early warning signs of resource depletion or connectivity trouble, ensuring incident management processes can respond quickly.
Decision Making
Operators develop a strong intuition about shifts in utilization patterns. The ability to discern anomalies from visual plots is a key part of their expertise. Sometimes operators must make decisions quickly, and in critical situations, knowing your system well can reduce errors and improve your chances of successful mitigation. Other times, intuition leads to unfounded assumptions, which can result in catastrophic outcomes. Comprehensive monitoring helps verify guesses and gut feelings, and robust alert management filters out harmless fluctuations, focusing you on critical issues.
Baselining
Monitoring provides immediate insight into a system’s current state. This data often takes quantitative form and, when recorded on time series, becomes a rich source of information for baselining. Through historical data, you can spot normal metrics ranges and identify anomalies that point to performance issues or potential failures in server monitoring environments.
Establishing standard performance levels is essential. It applies to capacity planning, leads to formulation of data-backed Service-Level Agreements (SLAs), and, where inconsistencies are detected, can be a starting point for in-depth performance analysis. Dynamic thresholds and anomaly detection rely heavily on this baseline for more accurate alert notifications.
Predictions
In monitoring, a prediction is a quantitative forecast containing a degree of uncertainty about future levels of resources or phenomena leading to their utilization. Monitoring traffic and usage patterns over time serves as a source of information for decision support. With strategic monitoring alerts, you can quickly see if usage patterns deviate from expected amounts, potentially indicating a system-wide issue or an external factor. It helps you predict normal traffic levels during peaks and troughs, holidays, and key periods such as major global sporting events. When usage patterns trend outside projected limits, there is usually a good reason for it, even if that reason is not directly related to system operation. For instance, traffic patterns that drop below 20% of expected values for an extended period might stem from customers experiencing difficulties with their ISPs. Some Internet giants can conclusively narrow down the source of external failure and proactively help ISPs identify and mitigate faults.
Beyond predicting future workload, close interaction with monitoring may help predict business trends. Customers have different needs at different times of the year. The ability to predict demand and match it to seasonality translates directly into revenue gains. Monitoring these fluctuations also helps reduce alert fatigue by allowing you to refine alerts triggered under certain load thresholds.
Automation
Metrics are a source of quantitative information, and evaluation of an alarm state results in a yes-no answer to the simple question: is the monitored value within expected limits? This is essential in real-time monitoring setups where performance monitoring decisions must be made quickly. This has important implications for automation, especially in processes involving admission control, pause of operation, and estimations based on real-time data. Leveraging these metrics for automated responses is a cornerstone of effective monitoring in modern infrastructures.
Admission Control
Bursts of input may saturate a system’s capacity and require dropping some traffic. To prevent a uniformly bad experience for all users, a portion of inputs may be rejected—this is known as admission control. Its objective is to defend against thrashing that severely degrades performance. In such scenarios, alerting systems can detect spikes quickly, and priority alerts can help operators intervene when necessary.
Some implementations of admission control are known as the Big Red Button (BRB), requiring a human engineer to intervene and press it. Deciding when to stop admission is inherently inefficient: such decisions are usually made too late, often require approval or sign-off, and there is always the danger of someone forgetting to reset the button when the situation is back under control.
Consider using monitoring inputs for admission control. When metrics monitoring and alerting are in place, operators can respond more confidently. Monitoring-enabled mechanisms go into effect immediately when problems are detected, allowing for gradual and local degradation before sudden, global disasters. When the problem subsides, the protection mechanism stops without human supervision.
Autonomic Computing
Monitoring’s feedback loop is also central to Autonomic Computing (AC), an architecture in which the system regulates itself, enabling self-management and self-healing. AC was inspired by the human central nervous system, drawing an analogy with complex, distributed information systems. Unconscious processes, such as control over breathing rate, do not require human effort. The goal of AC is to minimize human intervention by replacing it with self-regulation. Comprehensive monitoring can help achieve this by automating detection of critical alerts and immediate response.
Monitoring and Alerting in a Nutshell
Having discussed their purpose, let’s move on to how these processes are done. Monitoring is a continuous process, a series of steps carried out in a loop. This section outlines its workings and introduces monitoring’s fundamental building blocks. Integrating robust server monitoring with effective techniques ensures critical issues are rapidly communicated to the right team members, reducing downtime and alert fatigue.
Metrics and Timeseries
Watching and evaluating timeseries—chronologically ordered lists of data points—is at the core of monitoring and alerting. These time series provide real-time insights by displaying dynamic changes in network performance or server disk space usage, enabling real-time monitoring and anomaly detection.
Monitoring consists of recording and analyzing quantitative inputs—numeric measurements carrying information about current state and changes. Each data input has properties describing its origin, units, and time of sampling.
Inputs and their properties are stored as metrics, which are data structures optimized for storage and retrieval of numeric inputs. The resulting collection of inputs may be interpreted in many ways based on their assigned properties, allowing tools to evaluate inputs as a whole and at many levels of granularity.
Data inputs extracted from selected metrics are grouped by measurement time. Groups are assigned to uniform intervals on a time axis, and the total inputs in each group can be summarized by a mathematical transformation, called a summary statistic. This yields one numeric data point for each time interval. The collection of data points—a timeseries—describes a statistical aspect of all inputs from a given time range. The same set of data inputs may be used to generate different data points, depending on the summary statistic chosen. Well-structured metrics monitoring and alerting allow you to quickly see trends in real-time data and identify potential bottlenecks.
Alarms, Alerts, and Monitors
An alarm is a configuration describing a system’s change in state, most typically a highly undesirable one, through fluctuations of data points in a timeseries. Alarms are made up of metric monitors and date-time evaluations, and may optionally nest other alarms. When properly configured, alarms can reduce alert noise and help manage critical issues more effectively.
An alert is a notification of a potential problem, which can take one or more forms: email, SMS, phone call, or ticket. An alert is issued by an alarm when the system transitions through a threshold, and this breach is detected by a monitor. For example, you may configure an alarm to alert you when the system exceeds 80% CPU utilization for a continuous period of 10 minutes—an example of applying dynamic thresholds that adapt to your system’s usual time series values.
A metric monitor is attached to a timeseries and evaluates it against a threshold. The threshold consists of limits (expressed as the number of data points) and the duration of the breach. When arriving data points fall below, exceed, or go outside the defined range for long enough, the threshold is breached and the monitor transitions from clear to alert state. When data points fall within the limits of the defined threshold, the monitor recovers and returns to clear state. Monitor states factor into alarm state evaluation.
Monitoring System
A monitoring system is a set of software components that performs measurements and collects, stores, and interprets monitored data. The system is optimized for efficient storage and prompt retrieval of monitoring metrics for visual inspection of timeseries and data point analysis for alerting. Such systems can encompass many forms of network management, including network monitoring alerts and server resources, ensuring a holistic view of performance and availability.
Many vendors have taken up the challenge of designing and implementing monitoring systems. Numerous open source products are available, and more cloud vendors offer monitoring and alerting as a service. Listing them here makes little sense as the list is very dynamic. Instead, refer to the Wikipedia article on comparing network monitoring systems, which compares about 60 systems and classifies each in around 17 categories based on supported features, operation mode, and licensing.
It’s good to ask the following questions when selecting a monitoring product:
- What are the fees and restrictions imposed by the product’s license?
- Was the solution designed with reliability and resilience in mind? If not, how much effort will go into monitoring the monitoring platform itself?
- Is it capable of juxtaposing timeseries from arbitrary metrics on the same plot as needed?
- Does it produce timeseries plots of fine enough granularity?
- Does its alerting platform empower experienced users to create sophisticated alarms?
- Does it offer API access to export gathered data for offline analysis?
- How difficult is it to scale as your system expands?
- How easily can you migrate from it to another monitoring or alerting solution?
Most monitoring systems share a similar high-level architecture and operate on similar principles. Agents gather and submit inputs to the monitoring system through its specialized write-only interface. The system stores data inputs in metrics and may submit fresh data points for evaluation of threshold breach conditions. When a breach is detected, an alert may be sent to notify the operator about the fault. The operator analyzes timeseries plots and draws conclusions that lead to mitigative action. Generally, the process is broken down into three functional parts:
1. Data Collection
Data about system operations is collected by agents from servers, databases, and network equipment. Sources include logs, device statistics, and system measurements. Collection agents group inputs into metrics and assign properties that serve as an address in space and time. Inputs are submitted to the monitoring system through an agreed protocol and stored in the metrics database. This helps minimize alert fatigue by pinpointing genuine performance issues.
2. Data Aggregation and Storage
Incoming data inputs are grouped and collated by their properties and stored in their respective metrics. Data inputs are retrieved from metrics and summarized by a summary statistic to yield a timeseries. Resulting timeseries data points are submitted one by one to an alarm evaluation engine and checked for anomalous conditions. When such conditions are detected, an alarm goes off and dispatches an alert to the operator. Real-time monitoring updates these processes almost immediately, ensuring actionable alerts and efficient management.
3. Presentation
The operator may generate timeseries plots for an overview of the current state or in response to receiving an alert. When a fault is identified and a corrective action is taken, the graphs should give immediate feedback and reflect how the corrective action helped. If no improvement is observed, further intervention may be necessary. Real-time insights allow operators to quickly assess whether changes have rectified performance issues.
A monitoring system provides a point of reference for all operators. Its benefits are most pronounced in mature organizations where infrastructure teams, systems engineering, application developers, and operations can interact freely, exchange observations, and reassign responsibilities. Having a single point of reference for all teams significantly boosts detection and mitigation efficacy. Network alert management becomes simpler, as every operator works from the same set of data. This also helps reduce alert fatigue through unified coordination in filtering out unimportant or repetitive alerts.
The Process of Alerting
Human operators play a central role in system monitoring. The process starts with establishing the system’s baseline—gathering information about performance levels and system behavior under normal conditions. This serves as a starting point for creating an initial alerting configuration, which defines abnormal conditions by setting thresholds for exceptional metric values. With dynamic thresholds, it becomes much easier to tune these benchmarks and avoid frequent false positives.
Ideally, alarms should generate alerts only in response to actual defects that affect normal system operation. Unfortunately, that’s not always the case.
When thresholds are set too liberally, legitimate problems may not be detected in time and the system risks performance degradation, which may lead to downtime. When problems are discovered and mitigated, the alerting configuration should be tightened to prevent recurrence of costly outages.
Alarm monitors can also be created with overly sensitive thresholds, leading to a high likelihood of alarms being triggered by normal system operation. In such scenarios, alarms generate alerts when no harm is done. The baseline should then be reevaluated and monitors adjusted to improve detection of real issues.
Most alarms, however, go off for valid reasons and identify faults that can be mitigated. When that happens, an operator investigates the problem, starting with the metric that triggered the threshold breach and reasoning backward to find the cause. When a satisfactory explanation is found and corrective measures are taken, the metrics reflect that and the alarm transitions back to clear state. If metrics do not reveal improvement, that raises questions about the effectiveness of the mitigation and alternative action may be needed.
After a successful recovery, system metrics might improve enough to warrant another baseline recalculation and adjustment of the alarm configuration. Through these iterative improvements, organizations can systematically reduce alert noise and ensure that priority alerts gain focused attention.
Issue Tracking
An Issue Tracking System (ITS) is a database of reported problems recorded as tickets. It facilitates prioritization and tracking of reported problems, as well as efficient collaboration between individuals and teams. Alerts often take the form of tickets, making their role in prioritization and event response highly relevant to alerting.
Tickets and Queues
A ticket describes a problem with a chronological record of actions taken to resolve it.
Tickets are a convenient mechanism for prioritizing incoming issues and enabling collaboration between team members. They may be filed by humans or generated by automated processes, such as alarms attached to metric monitors. Either way, they are indispensable in resolving problems and serve as a central point of reference for all parties involved. New information is appended to the ticket through updates, with the most recent update reflecting the latest state. When a solution is found and applied, the ticket is archived and its state changes from “open” to “resolved.”
Each ticket has a title outlining symptoms of the reported problem, a detailed description, and an assigned severity level—typically urgent, high, normal, low, or trivial. Tickets also have miscellaneous properties, such as information about the requester and timestamps for creation and modification, used in reporting.
Operators are expected to work on tickets in order of priority, from most to least severe. To assist, tickets are placed in priority queues. Each queue is a database query returning a list of ticket entries sorted by predefined criteria, most commonly by priority in descending order and, among priorities, by date from oldest to newest.
Depending on the organization’s structure and size, an ITS may host one to hundreds of ticket queues. Tickets are reassigned between queues to signal transfer of responsibility for issue resolution. A team may own several queues, each for a separate category of tickets.
Tickets resolved over time create a body of knowledge with valuable information about system problems, sources, solutions for mitigation, and the quality of work carried out by operators.
The Challenges
Effective monitoring requires conscious, ongoing effort. It is not a trivial process and has many facets. Sometimes priorities must be balanced. While an ad hoc approach often requires more effort, good preparation makes monitoring more effortless. Here are factors that make monitoring difficult. In field service management, this also extends to managing alerts across multiple client sites and diverse equipment, making a well-structured monitoring and alerting strategy crucial.
Baselining
The challenge with baselines is not establishing them, but their volatility. Few areas exemplify “nothing endures but change” more than information systems. Hardware gets faster, software has fewer bugs, infrastructure becomes more reliable. Sometimes architects trade off one resource for another, other times they give up a feature to focus on core functionality. The implication for monitoring and alerting is that alarms can quickly become stale and their maintenance adds to operational burden. Integrating historical data with real-time insights helps you adapt alert thresholds promptly.
Coverage
Full monitoring coverage should follow a system’s expansion and structural changes, but often it does not. More commonly, configurations are set up at the start and revisited only when necessary, or—worse—when they are so out of date that real problems are noticed by end users. Maintaining full monitoring coverage, essential for detecting problems, is often neglected until it’s too late. By employing performance monitoring for network performance and implementing server monitoring across all endpoints, you minimize the risk of missing critical alerts.
Manageability
Large monitoring configurations include tens of thousands of metrics and thousands of alarms. Complex setups are expensive to maintain and prone to human misinterpretation and oversight. Without a systematic approach and rich instrumentation, configurations become increasingly inconsistent and hard to manage. Introducing machine learning to refine alert thresholds can assist in managing alerts effectively by reducing extraneous triggers.
Accuracy
Sometimes faults remain undetected, while other times alarms go off despite no immediate or eventual danger of noticeable impact. Reducing both kinds of errors is a constant challenge, often requiring decisions that might seem counterintuitive. Leveraging dynamic thresholds and anomaly detection can significantly reduce alert noise and false positives, making real-time monitoring more accurate and valuable.
Context
Monitoring’s main objective is to identify and pinpoint the source of problems promptly. Time is too precious for in-depth analysis. For complex data to be presented efficiently, large sets of numbers must be reduced to single values or classified into buckets. Observers must make accurate assumptions based on a thorough understanding of the underlying data, collection method, and source. Reviewing metrics over historical data makes it easier to distinguish genuine issues from minor fluctuations.
Human Nature
In striving for results, humans often see what they want rather than what’s actually there. Important information may be discarded as outliers or as negligible. Operators get away with neglecting outliers most of the time, but on rare occasions, especially at large scale, neglecting these outliers may result in high-visibility outages. Humans are poor intuitive statisticians, prone to setting round thresholds and losing sense of proportion. Monitoring systems enhanced with metrics monitoring and alerting can minimize human error by suggesting data-driven thresholds that better reflect real-time conditions.
Important Terms
There is discrepancy in monitoring vocabulary. Many organizations, especially those with established cultures, use specific monitoring terms interchangeably. Here is a short glossary of the most important terms. In modern field service management systems, these definitions ensure stakeholders share an understanding of how alert notifications are defined and handled.
Agent
A software process that continuously records data inputs and reports them to a monitoring system. Agents may run on various servers or devices, offering granular data to assess performance issues.
Alarm
A configuration describing an undesirable condition and alerts issued in response. Well-designed alarms help reduce alert fatigue by focusing on genuine critical issues.
Alert
A notification message about a change of state, typically signifying a potential problem. Alerts triggered by time series spikes or dynamic threshold breaches can be sent via phone, email, or text message.
Alerting
The process of configuring alarms and alerts. Effective alerting helps prioritize urgent situations, ensuring the right people get the right information at the right time.
Data Input
A numeric value with a set of properties gathered at the source of the measurement by a monitoring agent.
Data Point
A numeric value summarizing one or multiple data inputs reported in a defined time interval. A series of data points makes up a timeseries.
Metric
A collection of data inputs described by a set of properties. Timeseries are often mistakenly referred to as metrics. Monitoring metrics should not be confused with performance metrics, which are high-level business performance indicators.
Monitor
A process evaluating the most recent data points on a timeseries for threshold fit. This is an integral part of an alarm. Monitors keep a vigilant eye on server monitoring, network alert conditions, and more.
Monitoring
The process of collecting and retrieving relevant data describing a change of state. Monitoring alerts can be configured for everything from network performance to disk space usage in real time.
Timeseries
A list of data points sorted in natural temporal order, most commonly presented on a plot. Timeseries analysis forms the backbone of real-time monitoring and anomaly detection.
If you would like to learn more about the TimeLinx Alerts and Monitoring Module, please click here. By integrating these strategies and tools, you can strengthen your field service management solution and better manage critical issues, minimizing alert noise and reducing alert fatigue within your operations.