Start monitoring your service today 👨🔬
If you are a software engineer in a tech company you probably hear a lot about observability and monitoring, but most of us don’t know how to answer the most basic question like what do we need to monitor? How do we start monitoring our system?
But first… Definition of observability/monitoring
An observable software system provides the ability to understand any issue that arises. Conventionally, the three pillars of observability data are metrics, logs and traces.
https://www.dynatrace.com/monitoring/platform/observabilityMonitoring is tooling or a technical solution that allows teams to watch and understand the state of their systems. Monitoring is based on gathering predefined sets of metrics or logs.
https://cloud.google.com/architecture/devops/devops-measurement-monitoring-and-observability
Ok… from this we understand that what we want is to add to our systems the capability of exposing data about itself in a way ourselves and other engineers can understand what is happening in the system without knowing its internals.
And why?
Now that we know what monitoring is we need to understand why we want this capability.
To convince you I'll list down some of the benefits I see:
- Increase productivity while dealing with bugs
- Understand service perfomance behavior over time
- Find places for performance improvements (bottlenecks)
- Prevent wasting resources
- Help defining SLA (Service Level Agreement) and monitor them
- Understand service throughput capacity
- Define alarms
There might be a couple of other benefits that I didn't list here but, in order to achieve all this, we need to understand what we should monitor and how to start.
What we should monitor?
It depends ✨
Each service has its own capabilities and you need to understand them before creating any metric. But I won't let you withou any concrete examples 😅
There are some metrics that will benefit almost any kind of service that you build and here is a list ofsome of them.
HTTP communication
Considering your service exposes HTTP endpoints and interact with other services via HTTP, you should measure:
- Requests received:
Count and latency by method, path and status-code - Requests sent:
Count and latency by method, path and status-code
Resources utilization
It’s always interesting to understand how much resources your service needs and its behavior over time. This helps you to identify, for example, memory leaks and unused resources.
- Memory consumption
- CPU consumption
- CPU throttling
Database utilization
One of the most common system bottlenecks is the Database. There might exist bad queries, indexes missing, pool size limits and many other issues. This will vary according to the DB you use but some simple and valuables metrics are:
- Query execution time (with query identification)
- Pool size monitoring
Async process
Considering your services makes use of something like a message broker (kafka, rabbitMQ) you could measure:
- Count of messages not processed (lag) by topic
- Count of errors while processing messages (deadletters) by topic
- Messages consumption throughput by topic
Exceptions
You should also enable your service to provide qualitative data about exceptions and, for that, you can use logs (Logging definition). Some of most important logs you should have are:
- Error details on 5xx/4xx HTTP requests (received or sent)
- Error details when processing a message
And how do we expose this data?
Now that you already have an idea of what you could expose you need to understand how to instrument your application and how to visualize all this data.
Instrumenting
Quantitative: the most common way of exposing counter and histogram like metrics is using Prometheus.
Using Prometheus is not that hard but it’s a whole topic for itself and I’ll leave it’s details for future articles
Qualitative: logs are simple and easy. To start you can just define a structure for your Logs and print it to STDOUT.
Things can get incredibly complicated here but we’ll also cover this is another article
Ps: Instrument you systems in a way it can be easily replicated to other services. Don’t let it became an overhead while developing new features.
Visualization
To create dashboards, alarms and make efficient use of all this data you could use something like Grafana.
Also a whole topic itself, not in this article scope. Sorry.
Conclusion
Give monitoring a try. It will certainly increase your team productivity, confidence on shipping new features and service quality, efficiency and performance.
Appendix
Sorry for not going deeper in details about each topic but I don’t want to overfeed you with a bunch of details of implementation here. This is not the goal of this article.