This post is part of the Observability 101 series. The series covers basic ideas behind monitoring and observability and, I hope, it will help engineers to make their products better prepared for a long live in production.
The cheapest and easiest way to express a change to the internal state of an application is to measure and report the change as a number. A sequence of numbers, accompanied by timestamps, forms a time series. The timestamps point to moments when each of the measurements was made. The time series plus some meaningful name make a metric.
Now you know what metric is. Next, I’ll explain why metrics are useful and touch on a few other nuances.
Why are metrics important?
In the post “What is observability?” I mentioned that the main idea of the observability is to give operators an idea about how a system is running at the current moment. With this information the operators can detect problems before they cause a catastrophic failure of the system.
There are different ways to inform operators about potential problems. For example, let’s tell them that Total CPU utilization is 50%:
- Metric: cpu.utilization.total 50 1527075001
- Log message: [2018–05–23 11:30:01 UTC] INFO: Current Total CPU utilization is 50%
As you can see, the log line has more information than the metric. It’s structured better and even the date is in a human readable format. Does this help the operator solve the problem? The answer is yes, but only if this message is unique and tells the operator what is going on and what to do just from reading it. It’s possible that after reading this log line the operator will ask a legit question: “So what? Is 50% good or bad, what was CPU utilization a minute ago, what was it yesterday at the same time, what was it a month ago at the same time, what was it on the 4th Wednesday of the first month of the last quarter?”
Can these questions be answered using all the same log lines? Yes they can, but only if we keep logs for long enough. So why do we need metrics? Because of 2 simple reasons:
- Metrics are structured by default. This structure makes search easier. Logs, in contrast, are free text and every line need to be parsed to extract facts. The format of the lines might change over time with configuration or code changes, meaning several parsers are needed to extract the same fact from log lines written some time apart.
- Metrics need dramatically less space to store a data point than a log line. For example, authors of prometheus.io claim that it needs on average 1.37 bytes to store a single data point. We can keep more metrics for longer than log lines using the same storage. Keeping more metrics for longer means operators can better understand of how the system works and can identify longer trends.
To summarize this section: metrics aren’t the only way to store information about the internal state of the system. However they are the most efficient way to store the information.
What are additional attributes of metrics?
Let’s go back to the example metric from the previous section: cpu.utilization 50 1527075001. What is this example comprised of?
- cpu.utilization.total — name of the metric
- 50 — value of the measurement
- 1527075001 — timestamp of the moment when the measurement happened
What else is important?
- Coordinates from where the metric came from. If it is a unique metric and it comes from a single producer then cpu.utilization.total is all we need to know about it. But if there are several producers of a metric, for example if every server reports the metric for total CPU utilization by all processes on the host, and every application reports the same metric but only for CPU utilized by the application, then we need a way to segregate data points by producers. To do that we add host and producer attributes to the metric. Two pairs of attributes identify two independent producers of the metric: host=hostA, producer=host and host=hostA, producer=appA. With these attributes operators can compare CPU utilization by instances of the application appA running on different hosts or get a ratio of CPU utilized by the application appA versus total CPU utilization on the host.
- Baseline. A single data point can tell us the result of the latest measurement and and a time series — what results were observed earlier. Metric doesn’t tell us what value or range is expected and normal. A normal value can be defined either using historical data (what was the result of the measurements when we believe the system ran “as expected”?). Or you can define normal by some external requirements or common sense (we don’t want any 5XX HTTP errors from a service, or frame rate should be at least 24 frames per second).
What should you measure?
Deciding what to measure is extremely important. James Turnbull gave a talk in 2015 at the Velocity conference in New York. His presentation was called “Monitoring as a Service.” It contained an eye opening slide, for me at least. The slide says: “Focus on business outcomes.” It means that business-related metrics are the most important. I agree with him. I see metrics as another type on information teams use to describe the product. Key domain-specific metrics help build understanding. It’s great if the monitoring system shows a disk buffer usage, but it’s useless if it can’t show the number of accepted, processed, and rejected orders for an order processing system.
There is a simple rule to help you decide about the priority of measurement implementation: start with metrics about the business domain for your product. Next, add measurements to cover the implementation details of your application; if there’s an important queue in the application, measure how many messages are injected and consumed from the queue per minute. About the same priority is given for measurements of how your application communicates with other services: how many requests are sent to a database and what is the response time. Last in the list are measurements about the infrastructure layer: disk usage, CPU utilization, and bytes sent/received over the network.
It’s very easy to begin the process from the infrastructure metrics because usually they are there already. The problem is that these metrics don’t tell you anything about the stability and performance of your application.
Metrics with fixed intervals vs sparse metrics
There are two kinds of event from the perspective of continuity: continuous events and sparse events. Accordingly, the results of measurements of events can be continuous or sparse. An example of a metric for continuous events is CPU utilization. A program needs CPU resources all the time while it’s running, and the measurement will always have something to report. An example of a metric for sparse events is an error counter (hopefully errors in your system aren’t continuous events).
There’s no question about how to handle metrics for continuous events: measure and report. The real question here is what to do with metrics for sparse events. A collector agent asks for the result of the measurement again and again over some fixed period of time. What should the system report back if there were no events between the requests?
There are 3 possible options: report nothing, or report some default value (like 0), or report the last given value if the metric is a counter. No option is wrong. Whatever you choose, it should be globally applied to all sparse metrics in the system or even in the company. The reason is simple: the more standardized the monitoring system is, the easier it is for engineers to use data from the system.
Long names vs short name and tags
As mentioned earlier, it’s very useful to attribute metrics with information about producers of measurements. This way consumers of the metrics can later group them by hosts or by applications or by any other attributes.
There are 2 common ways to pass the attributes: as part of a metric name or as additional tags.
Attributes as part of a metric name
Keeping all information in the metric name is the easiest approach and it works fine for smaller setups. For example, if we there is a metric cpu.usage.total from application appA on a host hostB, the final name of the metric can be cpu.usage.total.hostB.appA or hostB.appA.cpu.usage.total. It’s up to a team to decide which option is better. The con of this solution is that it isn’t flexible. If the team needs to add a data center into the schema, all metrics will get new names and there’s a high chance existing data is stored under older names and becomes unavailable under the new naming convention.
Attributes as tags
Using tags is another approach for storing attributes. Metric names remain short, but a map of additional attributes are added. Using the tag system, the metric name from the previous example is cpu.usage.total{host=hostB,app=appA}. This approach makes the system more flexible because attributes can be easily added and removed. The con is that implementation of such a schema is a bit harder on both the application and storage sides..
How does it work?
Now that we have a better understanding of metrics, it’s time to get some idea about how the system works end-to-end.
Instrumentation
At this phase, an application or the OS runs code to perform a measurement and puts results into a buffer either an in-memory data structure or a file. For example the application can count the number of inbound requests, or the OS can count packets sent over network interface.
Collection
A collector agent calls an application endpoint to get the results of the latest measurements, or it reads them from a file (i.e. from /proc file system). Another option is that the application pushes the latest measurements to the collector agent.
These two approaches are called push and pull models. Under the push model, a producer of metrics knows where the collector agent is and pushes data there. If the agent is unavailable at that moment, the producer decides what to do with the data: either drop it on the floor or keep it and retry later. The pro of the push model is that all applications know only the minimum amount of information. Essentially, they know the URL of the collector agent and what protocol to use. The con is that the collector agent can be flooded with metrics, and it has to accept all data pushed by producers. A load balancer is needed to balance load between several collector agents.
With the pull model, producers keep data locally until the collector agent asks for it. The pros are that applications can provide a dynamic API to pull metrics with different level of details, and collector agents can balance workload by taking a subset of producers from a shared list. The con of the pull model is that collector agents must know the URLs of all endpoints they should collect from. This part can be tricky if they deal with dynamic infrastructure. Collector agents try to pull data at fixed intervals, but variations are possible. And because of the degree of parallelism they use to pull data, the number of producers and response time at each producer vary.
Transfer
The collector agents often are only intermediate nodes between producers and actual long term storage.
Data from producers is transferred over:
- HTTP protocol — push and pull models
- UDP protocol — push model
- Other kinds of RPC protocols like RMI in JVM — pull model
Data from collector agents moving to long term storage can go over HTTP, UDP protocols or can be buffered in some kind of queue like Apache Kafka or RabbitMQ.
Query and analysis
When data is accumulated in long term storage (some systems use in-memory storage for hot data, so “long” means something between hours and years), the next phase begins.
Users of the system can query, transform and analyze the data to make decisions about the health, stability, and current and projected capacity of applications and underlying infrastructure as well as other kinds of analysis.
Another use case for the data is alerting. An alerting system defines queries, transformations, thresholds and actions. The actions are executed when results returned by a query exceed thresholds, for example sending an email can be sent when disk utilization on any host goes above 80%.
Conclusions
Metrics are the simplest and cheapest way to collect, transfer, and accumulate information about internal state of applications and underlying hardware.
Metrics can be pushed or pulled from producers.
Metrics should have information about producers. The information can be embedded into the names of the metrics or it can be stored as a dictionary of tags attached to the metrics.
In the further posts in the “Observability 101” series I’ll dive deeper into topics touched on in this post as well more topics related to monitoring and observability. Stay tuned.