monitoring with prometheus

Providing systems observability for modern-day systems is crucial to any big or small company’s infrastructure. Prometheus is a popular, widely used open-source metrics-based monitoring system written in Go, initially developed in 2012 at SoundCloud. The first time I came across Prometheus was back in 2017, and it was refreshing coming from using Graphite with StatsD, another popular solution that allows to monitor systems.

It was an interesting experience when after changing jobs I moved from a company that used Prometheus to a company that used Graphite/StatsD for service monitoring. Four years ago, I still remember pitching Prometheus to my manager, and I was excited that I got the green light after showcasing a proof of concept demonstrating use of Prometheus with Graphite Bridge. The bridge allows using Prometheus by pushing metrics to Graphite via Carbon-Relay. This is great as it enables teams to migrate to the new monitoring system incrementally without requiring a hard switch as both monitoring systems can be used at the same time while only using Prometheus metrics registerer for metrics collection on the application side.

How does it differ from Graphite & StatsD?

Graphite is a time-series database and visualisation tool, while StatsD is a simple network daemon used to aggregate and summarise metrics (originally developed and open-sourced by Etsy). On the other hand, Prometheus is an open-source monitoring and alerting toolkit designed for cloud-native environments, such as Kubernetes. It’s also important to note that one of the main differences is that Prometheus is a pull-based system, whereas Graphite/StatsD is push-based.

Is Prometheus the best monitoring solution?

It depends on individual use case, however from a developer point of view, I have always enjoyed working with Prometheus while instrumenting the code. However, there are few things that makes Prometheus stand out from competition.

  • Provides dynamic service discovery, while Graphite with StatsD requires each service to be configured individually.
  • PromQL, a query language that provides operations such as filtering, aggregation, and joining metrics, is a powerful tool for data visualisation and analysis.
  • Provides a built-in alerting mechanism allowing operators to define alerting rules based on metric thresholds, trends, or other conditions. Alerts can be sent to various integrations, e.g. Slack, PagerDuty, etc.

What are the potential drawbacks?

  • Cardinality bombs - It’s easy to make this mistake when not being careful about what values are being passed to labels. This will cause an excessive increase in unique metric labels, leading to rapid growth in the number of time series stored in the database, for example, using a user ID as a label value. This will result in increased storage requirements, longer query times and reduced performance. Therefore, it’s critical to carefully manage and monitor the cardinality of metrics in a monitoring system to prevent cardinality bombs and ensure efficient and effective monitoring.
  • Storage - It uses a local on-disk storage model, where metrics are stored locally on the Prometheus server. This will result in high storage requirements when monitoring many services or when storing metrics (especially when there is a cardinality bomb!). Furthermore, this can impact storage costs requiring regular disk space maintenance.
  • Data retention - It’s built for real-time monitoring, meaning that data is stored for a short period and isn’t optimised for storing and querying historical data over extended periods. Even though Prometheus isn’t designed for long-term storage, it’s possible to use Thanos, an open-source project that extends the capabilities of Prometheus, to make it a highly available solution with long-term storage capabilities. Some of the features provided by Thanos are data compaction, deduplication, and downsampling to optimise storage and query performance.
  • Complexity - It’s pretty complex. You need to know what you are doing when writing PromQL queries, and it requires a steep learning curve to utilise PromQL to its fullest potential.

Can it collect network metrics?

Prometheus can collect SNMP-based data using SNMP exporters, small dedicated apps that expose Prometheus metrics (there is an exporter for almost everything - Docker, Kubernetes, Redis, NGINX, Apache, etc.). Some of SNMP exporters:

  • snmp_exporter - official exporter, providing support for SNMPv1, SNMPv2c, and SNMPv3 protocols. Configurable by defining SNMP queries to collect specific metrics from SNMP-enabled devices.
  • cisco_exporter - provides pre-defined metrics for monitoring various aspects of Cisco devices, such as interfaces, CPU usage, memory utilisation, etc.
  • junos_exporter - exposes metrics related to interfaces, routing protocols, chassis health, and other Juniper-specific metrics.
  • unifi_exporter - Ubiquiti UniFi networking devices
  • dell_exporter - Dell PowerEdge servers

Prometheus is a modern real-time monitoring system that works very well with Kubernetes, while providing unique features such as PromQL for querying, Alertmanager for alerting compared to other solutions. I have been using Prometheus professionally since 2017, and for fun I have been monitoring apps running on my Raspberry Pi Kubernetes cluster! Prometheus is a reliable and effective solution for monitoring and observing modern services in today’s complex environments. Although it comes with a steep learning curve, it’s completely worth the effort to use it.