Solving Enterprise Monitoring Issues with Prometheus


Chicago-based ShuttleCloud helps developers import user contacts and email data into their applications through standard API requests. As the venture-backed startup began to acquire more customers, they needed a way to scale system monitoring to meet the terms of their service-level agreements (SLAs). They turned to Prometheus, the open source systems monitoring and alerting toolkit originally built at SoundCloud, which is now a project at the Cloud-Native Computing Foundation. 

In advance of Prometheus Day, to be held Nov. 8-9 in Seattle, we talked to Ignacio Carretero, a ShuttleCloud software engineer, about why they chose Prometheus as their monitoring tool and what advice they would give to other small businesses seeking a similar solution.

Ignacio P. Carretero, ShuttleCloud Why did your enterprise start using a monitoring solution like Prometheus?

Ignacio Carretero: It started when our number of projects increased and new clients SLAs became more demanding. We had some systems in place to monitor the operation metrics (the status of our instances), business metrics (how we were performing) and if our external front ends were up and running. However, we did not have a centralized monitoring system or a standard alerting system. In addition, some business metrics had to be manually reviewed. All of the aforementioned have been solved with a monitoring solution like Prometheus.

Some of the reasons we chose Prometheus over other monitoring systems are its flexibility with the metric system (it doesn’t have to be fixed from the beginning), the independency from other external services (such as message buses or databases) and the simplicity of its installation and execution as it is a Go file. What are the most important things for small businesses to know when bringing on an in-house monitoring stack?

Ignacio: The most important thing we would mention is that having a Prometheus-based in-house monitoring solution does not have to be expensive. It is possible to start monitoring a complete infrastructure with only one instance and not a lot of development/setup time. Apart from that, it is good to know that monitoring is not a goal but a journey, and we must confess that this has been a pleasant one. Throughout this journey you’ll fine tune alerts and progress through the stages of getting your infrastructure monitored. In the precise case of Prometheus, we are also very satisfied with the available exporters. They mostly can be integrated without investing a lot of time which is always important for small businesses. What is the journey like to equip your infrastructure with monitoring technology? Is the process different for small businesses?

Ignacio: The main difference is that we do not have a specialized team that can take care of that process so the whole team has to be involved. Every single developer in our engineering team collaborates with what is being monitored by Prometheus and what remains monitored by the legacy systems. We all solve the alerts triggered so we can participate on the tuning of thresholds and adding missing alerts and removing unnecessary ones. What lessons did you learn while deploying a monitoring technology?

Ignacio: At the beginning we were very constrained with the time we could dedicate to the implementation so we decided to start small. Therefore, we recycled some of the systems that we had already in place. To do so, we took some decisions that were against the design patterns of Prometheus. It might not be the ideal design but at least we had a starting point. From the starting point, we iterated and improved our system as we started to understand some of the things we were doing wrong and what things could be improved. If we had waited until we designed a perfect system, more than likely we would still have our old service in place. What are the major benefits your environment has seen from using Prometheus?

Ignacio: The most important thing for us, the developers, is that we now totally trust our monitoring system. Before we were constantly checking if everything was alright and if there was any issue. We currently know that if any of the thresholds is reached, someone will be paged or an email will be sent depending on the urgency of the issue.

Another major benefit is that the system is fairly easy to maintain. We still do improvements and fine tuning but the overall maintenance overhead has been kept to a minimum, even if we continue growing.

Finally, we would like to point out that PromQL is really useful and logical (the Prometheus query language). The learning curve is definitely worth the effort. PromQL is also used for chart creation in Grafana, which is very easy to integrate with Prometheus.

For readers only: get 20% off your CloudNativeCon + KubeCon & PrometheusDay passes with code CNKC16LNXCM. Register now.