What Is Prometheus?
Prometheus is a metrics-based monitoring and alerting stack that is purpose-built for handling metrics generated by dynamic cloud environments such as Kubernetes. Another example target for Prometheus to gather metrics from could be a web app or an API server. As well as its use as an alerting and monitoring stack, Prometheus is also able to perform addition, average metrics over time and add and multiply any time-series data ingested.
Prometheus started life in 2012 at SoundCloud and was initially created due to the lack of monitoring tools available that were suitable for monitoring dynamic cluster scheduling within containers. Prometheus became fully publicised in 2015 and has been part of the CNCF (Cloud Native Computing Foundation) as of 2016.
Prometheus is written in the Go programming language and remains fully open-source as a metrics-based monitoring system that even has its own dedicated query language, PromQL. At its core, Prometheus is a multi-dimensional data model that includes support for time series data.
Prometheus is made up of three main components with the biggest of these being the time series itself, the second component is the alert engine and the third and final component is a robust scraping engine.
Prometheus is well known for being a flexible system that is able to perform metrics collection, display rudimentary graphs (which are somewhat limited in comparison to Kibana dashboards) and run complex mathematical queries.
An example of a calculation that Prometheus allows you to perform is to trace latency distribution within a histogram metric and then perform a PromQL function that allows you to calculate the 99th percentile from all of your instances. As Grafana has native support for Prometheus, the results from this calculation can then be visualised in Grafana to be used as part of a wider reporting dashboard.
In the event that you have a service that does not allow you to add a Prometheus endpoint directly (such as a Linux virtual machine or MySQL), you can use an exporter which is run as a sidecar alongside the service that you wish to monitor.
Exporters are used to speak to Prometheus and work as a middle man to export the data from the service that you wish to extract time-series metrics from, by converting this data into the format most suitable for Prometheus ingestion.
How Popular Is Prometheus?
To understand how popular Prometheus is we can conduct market analysis by using Ahrefs to gather more insights on how many people are looking for this solution each month.
Each month on average over 7,900 searches are conducted for the term "Prometheus software" with the top page appearing for this term attracting over 85,000 visitors every month. The most popular month for people searching for Prometheus software was September 2021 with over 3,000 searches being conducted within the United States alone.
The leading countries that express the most interest in Prometheus in descending order are the US, India, Germany, Spain, Canada and the United Kingdom.
In terms of who exactly uses Prometheus, some of the most notable companies we can look to include Walmart, Slack, Uber and Red Hat.
Github’s metrics also support the argument that Prometheus is a monitoring tool with a wide rate of adoption as there are currently 6.9k forks and 41.3k stars on Github for this solution.
Finally, we can also review Stackshare’s approval ratings which state that Prometheus has broad approval on the site and is mentioned as being used as part of the technology stacks of over 760 companies.
About Time Series Metrics
It is important to note that Prometheus tracks time-series data. Time-series data is different from other types of metrics as it has an identifier that stays the same so you can track the same metric over time and observe the developments that occur against the metric you are monitoring.
A Prometheus metric usually has the following attributes associated with it; a name, a set of labels, and a numeric value for that combination of labels.
If you are looking to learn more about other time-series databases then it may be worth researching StatsD, Graphite, OpenTSDB, Ganglia or InfluxDB. Unlike Graphite or StatsD though, Prometheus avoids using a hierarchical system.
Use Cases For Prometheus
Prometheus has many applicable use cases for monitoring and alerting but we also wanted to delve deeper to highlight several particular use cases that readers may have previously been unaware of.
Harnessing Prometheus can save an organisation money when it is used to discover if you are retaining too much RAM for a Kubernetes cluster. This is because Prometheus can ingest the metrics associated with seeing if a container is only using 10% of the RAM allocated. By surfacing the relevant time-series metrics in a reporting dashboard, Prometheus allows you to pinpoint low memory utilisation more efficiently. In this example, this simple misconfiguration leads to the unused memory not being used by other services which require it more and leads organisations to waste money due to incorrect memory reservation.
Prometheus can also use histograms which it stores locally within buckets. Histograms sample observations and provide a sum of all observed values. They are most often used to measure the distribution of attributes such as response sizes and are important for tracking request latency distributions.
In the event that you wish to measure payload sizes, you will likely find summary metrics useful for calculating a sliding time window to provide a total count and a sum of all observed values.
PromQL is noted by its users for being highly effective when used for time series computations. This query language is noted for not being similar to SQL in functionality as SQL style languages tend to be a bit obtuse when used to perform time-series operations.
Unfortunately, when using PromQL beginners almost near-universally have a difficult time with conducting vector matching as well as SQL joins. It is also notably hard to perform summing queries across metrics that have differing labelsets.
In addition to these complaints, many users note that identifying the worst offending metrics that are blowing up cardinality isn’t always easy to achieve either.
It is pertinent to note that discovering a way to bulk garbage collect unused metrics is also necessary as one organisation we saw on Twitter complained of having over 100PB of excess metrics being stored.
Some users will find that if their educational background was lacking a good foundation in statistics that they principally may not be able to make the most out of PromQL without further training.
To find out more about getting started with PromQL why not review our dedicated PromQL cheat sheet to learn about some of the most important commands that you’ll need to perform when slicing through your time-series data and getting to grips with Prometheus in general.
When larger organisations wish to scale Prometheus so that it is able to ingest more metrics, add more measurements, and attempt to create a global view for improved visibility, this can cause many challenges as surprisingly for a cloud-native service Prometheus doesn't have good horizontal scaling abilities.
Issues with Prometheus quickly start to become apparent when you deploy software across multiple disparate clusters or regions, especially if these projects are not directly connected.
Prometheus exporters will also need to be selected, configured and updated as you add new services that generate metrics to your existing technology stack.
Due to Prometheus’s use of labels to define the time-series (as opposed to dot-separated metrics) one of the unforeseen consequences of the label-based metrics model is that the risk of cardinality explosion is greatly increased whenever a user makes an update against a time-series metric’s label.
If managing your Kubernetes server in production is growing quickly out of scope then you will also need to scale your monitoring and alerting stack. Considering a hosted solution for Prometheus may be your best bet at scaling the best features of open-source ready for compliant enterprise use.