Only registred users can make comments

Prometheus Essentials: Introduction To Metric Types

While Prometheus is a great monitoring tool, it can feel scary at first glance, especially when you start looking at its query language, PromQL. Although the setup and implementation are generally straightforward, many users find working with PromQL to be the more challenging part.

In this article, we’ll take a different approach. Instead of starting with installation, we’ll begin by gaining a better understanding of what the tool provides.

Since the metrics is what we actually want to collect, we’ll focus on what different metric types Prometheus supports. 

Prometheus Series

This post is part of a Prometheus series.

Prometheus Essentials: Introduction To Metric Types (THIS ARTICLE)
Prometheus Essentials: Install and Start Monitoring Your App
Prometheus Essentials: Node exporter And System Monitoring

Requirements

The entire series is written in a way that beginners can easily follow along. The goal is to explain concepts in a very simple manner. Although it is beginner-friendly, it doesn’t mean it is completely free of requirements. I anticipate the following prerequisites:

Coding: You should have some coding skills, preferably in Python. This is because we have a simple Flask app, but I won’t be covering how to create virtual environments, install modules, etc. I assume you already know that.

Containers: You should be familiar with basic container techniques. Nothing too advanced, but we won’t be covering the basics either.

Computer Savvy: We don’t cover what’s considered essential computer knowledge you should already be comfortable with the basics 😂

Why would you use tool like Prometheus?

In any monitoring system, metrics are the foundation for understanding the health and performance of your applications and infrastructure. Without metrics, it’s impossible to track system behavior, detect anomalies, or make informed decisions based on data. That's exactly what Prometheus does, it's collecting and querying these metrics.

Where does Prometheus collect these metrics from?

Prometheus pulls metrics from a variety of sources that expose metrics in a format it can scrape.

Here are some examples what Prometheus can collect:

  • Application Metrics: Prometheus can collect metrics directly from applications, such as request counts, error rates, or response times. For instance, web services often expose HTTP request durations or the number of requests per endpoint.
  • Infrastructure Metrics: Prometheus can gather system-level metrics from servers, such as CPU usage, memory consumption, disk I/O, and network bandwidth.
  • Database Metrics: Many databases, like MySQL or PostgreSQL, expose metrics such as the number of active connections, query execution times, and read/write statistics.
  • Coffee Machine Metrics: Some engineers have fun by instrumenting non-IT equipment. A Prometheus setup could scrape data from an IoT-enabled coffee machine, tracking things like brewing time, water temperature, and coffee consumption.
  • Custom Business Metrics: In addition to standard technical metrics, businesses often track custom metrics like the number of items processed in a queue, customer signups, or transactions per minute.

Oh my… are there any limits, you may ask? No, there are not. You can play around with anything as long as the metrics are being exposed.

Metric Types

Prometheus supports four primary metric types:

1. Counter: A cumulative metric that only increases over time, used to count things like the number of requests or errors.

2. Gauge: A metric that can go up and down, typically used for values like CPU usage, memory consumption, or temperature.

3. Histogram: A metric that samples observations (like request durations or sizes) and counts them in configurable buckets, providing a distribution.

4. Summary: Similar to a histogram but with a focus on quantile estimation (e.g., 90th percentile response times), and also provides a total sum and count of observations.

Counter Metric Type

A counter is a cumulative metric that only ever increases. It is ideal for tracking things that continuously grow, like the number of requests processed, tasks completed, or errors encountered. Counters can be reset to zero (e.g., on service restarts), but they never decrease.

Counter Metrics Real-World Example

Let’s consider a real-world example where you want to track the total number of HTTP requests your web application receives. You can set up a Prometheus Counter metric for this:

  • http_requests_total: This metric represents the total number of HTTP requests your server has handled since it started. For example, it might start at 0 and increase with each incoming request: 1, 2, 3, and so on.

Every time your web server receives a new request, this counter increments, giving you a cumulative view of how much traffic your server has handled.

Let’s visualize how a counter behaves over time:

Time --->

|  
|  7       ___
|         /   
|  5 ____/     
|        
|  3    /       
|     /
|  1/___________________

Value -->

Explanation: The value of the counter increases over time, reflecting cumulative metrics like total HTTP requests. In this example, it starts at 1, then increments to 3, 5, and finally 7.

Why Use Counters?

Counters are perfect for metrics that should never decrease, like counting the number of successful tasks or the total number of errors encountered. For instance, in the case of an HTTP server, http_requests_total provides a clear view of the total requests processed. If you want to track how often an event occurs over time (like the number of API calls or database errors), a counter is the ideal choice.

Gauge Metric Type

The Gauge metric type in Prometheus is used to measure values that can fluctuate up or down. Gauges are ideal for tracking metrics like memory usage, CPU load, or the number of active connections, where the value changes dynamically over time.

❗Unlike counters, which only increase, gauges reflect the current state of a system and can both increase and decrease.

Fluctuating Values: Gauges track metrics that can rise or fall over time, such as temperature, disk usage, or number of active sessions.

Current State: A gauge represents the current value of a metric at the time it is scraped, providing a snapshot of the system.

No Accumulation: Gauges do not accumulate; instead, they represent the instantaneous value observed at each collection point.

Gauge Metrics Real-World Example

Let’s say you want to monitor the number of active users connected to your web application. You can use a Gauge metric in Prometheus for this:

  • active_users: This metric tracks the current number of users actively connected to your service. Since the number of users can fluctuate throughout the day, the gauge will reflect this ebb and flow.

For example, the number of active users could be 10 in the early morning, increase to 50 during peak hours, and then drop to 15 by the evening.

Here’s how the gauge might look as the number of active users fluctuates:

Time --->

|  
|       /\        
|      /  \     /\
|     /    \___/  \________
|    /
|___/_______________________

Value -->

Explanation: The gauge value rises and falls over time, showing that the number of active users increases, peaks, and then decreases.

Why Use Gauges?

Gauges are essential when tracking metrics that go up and down. For instance, monitoring real-time CPU usage, memory consumption, or active user counts requires gauges because these values fluctuate. Gauges provide a real-time view of system behavior, helping you keep an eye on key performance indicators as they change.

Histogram Metric Type

The Histogram metric type in Prometheus is useful for tracking the distribution of values over time. It works by recording observations (such as request durations or response sizes) and placing them into pre-configured buckets that represent ranges of values. Along with these buckets, histograms also provide the total number of observations and the sum of all values.

❗Unlike the Summary metric type which we cover below, histograms do not estimate quantiles directly but offer detailed information on the distribution of observations across defined ranges.

Components:

  • Bucket: Stores the count of observations that fall into each bucket.
  • Sum: Total sum of all observed values.
  • Count: Total number of observations.

Histogram Metrics Real-World Example

Imagine you run an online store and want to track how long customers spend on the checkout page. You could set up a Prometheus Histogram metric for this purpose. Here’s how it might work:

  • checkout_duration_seconds_bucket{le="1"}: Tracks how many checkouts were completed in 1 second or less.
  • checkout_duration_seconds_bucket{le="2"}: Tracks how many checkouts took up to 2 seconds.
  • checkout_duration_seconds_sum: The total time all customers have spent on the checkout page.
  • checkout_duration_seconds_count: The total number of checkout events.

Let’s break this down with an example using pre-configured time buckets:

| 1s  | 2s  | 3s  | 5s  | 10s  |

If a customer finishes checkout in 2.5 seconds, that event will be counted in the 3s, 5s, and 10s buckets (since it’s less than or equal to those values), but not in the 1s or 2s buckets.

+-------------+-------------+-------------+-------------+-------------+
| <= 1 sec    | <= 2 sec    | <= 3 sec    | <= 5 sec    | <= 10 sec   |
+-------------+-------------+-------------+-------------+-------------+
|   100       |   200       |   300       |   450       |   500       |
+-------------+-------------+-------------+-------------+-------------+

In this example, the table shows that:

  • 100 checkout processes took less than or equal to 1 second.
  • 200 checkout processes took less than or equal to 2 seconds.
  • 300 checkout processes took less than or equal to 3 seconds, and so on.

The buckets allow you to see how checkout durations are distributed, and from this, you can gain insights like how many customers are experiencing delays at checkout.

Why Use Histograms?

Histograms are particularly useful when you need to understand the distribution of observations within defined ranges. In the online store example, a histogram helps identify checkout time trends, such as how many customers experience slow checkouts, and can be used to monitor the performance over time or alert when too many customers experience slow processing.

Summary Metric Type

The Summary metric type helps track specific observations and gives detailed information like the total count of events, the sum of all values, and estimates of how often certain values occur. Unlike other metric types, Summaries provide quantiles, which are estimates of where a certain percentage of the data lies.

Think of it like timing how long people take to check out at a grocery store.

  • Sum: Total time all customers spent at the checkout.
  • Count: Total number of customers who went through the checkout.
  • Quantiles: How long it took for a certain percentage of customers to check out.

Components of a Summary Metric

  •  Quantiles: These estimate the time it takes for a percentage of customers to finish checkout. For example:
    • 50% (quantile 0.5) of the customers take less than 2 minutes to check out.
    • 99% (quantile 0.99) take less than 10 minutes.
  • Sum: The total time all customers combined spent at the checkout.
  • Count: The total number of customers who checked out.

Histograms Metric - Real-World Example

Again, let’s say you run a small grocery store, and you want to track how long customers spend at the checkout. You could use a Summary metric for this.

Let's look at the following example:

  • checkout_duration_seconds_sum: The total amount of time all customers have spent at the checkout.
  • checkout_duration_seconds_count: The number of customers who have completed checkout.
  • checkout_duration_seconds{quantile="0.99"}: The time it takes for 99% of customers to check out. This could be something like 10 minutes, meaning only 1% of customers take longer than 10 minutes.

These components—sum, count, and quantiles—are core features that define a Summary metric type in Prometheus.

How Quantiles Work

If you’re tracking how long customers take to check out, you might set the following quantiles:

This shows how checkout durations are distributed:

+-----------+-----------+------------+------------+
|  50%      |  90%      |  95%       |  99%       |
| (0.5)     | (0.9)     | (0.95)     | (0.99)     |
+-----------+-----------+------------+------------+
|  2 min    |  5 min    |  7 min     |  10 min    |
+-----------+-----------+------------+------------+
  • 50% of your customers finish checkout in less than 2 minutes.
  • 99% of them finish within 10 minutes.

This is useful because you can easily see the typical checkout duration and spot any outliers who take much longer.

Test Your Understanding

  1. What are the four primary metric types supported by Prometheus?

  2. When should you use a Counter metric type, and why?

  3. How does a Gauge differ from a Counter in terms of value changes?

  4. What is the purpose of buckets in a Histogram metric type?

  5. How are Summaries different from Histograms in terms of what they provide?

  6. Can you give a real-world example of when to use each metric type?

  7. What information does a Summary metric provide in addition to quantiles?

Final Thoughts

Understanding Prometheus metric types is important for building good monitoring systems.

Each metric type—Counter, Gauge, Histogram, and Summary—offers unique ways to observe and analyze system behavior, from cumulative counts and fluctuating values to detailed distributions and quantile estimates.

By choosing the right metric type for the data you want to collect, you can gain deeper insights into the health and performance of your systems. Whether you're tracking request counts, measuring resource usage, or analyzing response times, knowing how these metrics work will help you make better decisions and respond effectively to any changes in your environment.

I hope this breakdown of Prometheus metrics helps you feel more confident in using the tool for monitoring your applications. Feel free to share your thoughts or ask questions if you need more clarity on any of these concepts!

See you in the next post where we will write come code and start monitoring.

Happy Monitoring, space monkeys!

Comments