Start monitoring your service today 👨‍🔬

4 min readFeb 23, 2023

If you are a software engineer in a tech company you probably hear a lot about observability and monitoring, but most of us don’t know how to answer the most basic question like what do we need to monitor? How do we start monitoring our system?

But first… Definition of observability/monitoring

An observable software system provides the ability to understand any issue that arises. Conventionally, the three pillars of observability data are metrics, logs and traces.
https://www.dynatrace.com/monitoring/platform/observability
Monitoring is tooling or a technical solution that allows teams to watch and understand the state of their systems. Monitoring is based on gathering predefined sets of metrics or logs.
https://cloud.google.com/architecture/devops/devops-measurement-monitoring-and-observability

Ok… from this we understand that what we want is to add to our systems the capability of exposing data about itself in a way ourselves and other engineers can understand what is happening in the system without knowing its internals.

And why?

Now that we know what monitoring is we need to understand why we want this capability.

To convince you I'll list down some of the benefits I see:

Increase productivity while dealing with bugs
Understand service perfomance behavior over time
Find places for performance improvements (bottlenecks)
Prevent wasting resources
Help defining SLA (Service Level Agreement) and monitor them
Understand service throughput capacity
Define alarms

There might be a couple of other benefits that I didn't list here but, in order to achieve all this, we need to understand what we should monitor and how to start.

What we should monitor?

It depends ✨

Each service has its own capabilities and you need to understand them before creating any metric. But I won't let you withou any concrete examples 😅

There are some metrics that will benefit almost any kind of service that you build and here is a list ofsome of them.

HTTP communication

Considering your service exposes HTTP endpoints and interact with other services via HTTP, you should measure:

Requests received:
Count and latency by method, path and status-code
Requests sent:
Count and latency by method, path and status-code

Resources utilization

It’s always interesting to understand how much resources your service needs and its behavior over time. This helps you to identify, for example, memory leaks and unused resources.

Memory consumption
CPU consumption
CPU throttling

Database utilization

One of the most common system bottlenecks is the Database. There might exist bad queries, indexes missing, pool size limits and many other issues. This will vary according to the DB you use but some simple and valuables metrics are:

Query execution time (with query identification)
Pool size monitoring

Async process

Considering your services makes use of something like a message broker (kafka, rabbitMQ) you could measure:

Count of messages not processed (lag) by topic
Count of errors while processing messages (deadletters) by topic
Messages consumption throughput by topic

Exceptions

You should also enable your service to provide qualitative data about exceptions and, for that, you can use logs (Logging definition). Some of most important logs you should have are:

Error details on 5xx/4xx HTTP requests (received or sent)
Error details when processing a message

And how do we expose this data?

Now that you already have an idea of what you could expose you need to understand how to instrument your application and how to visualize all this data.

Instrumenting

Quantitative: the most common way of exposing counter and histogram like metrics is using Prometheus.

Using Prometheus is not that hard but it’s a whole topic for itself and I’ll leave it’s details for future articles

Qualitative: logs are simple and easy. To start you can just define a structure for your Logs and print it to STDOUT.

Things can get incredibly complicated here but we’ll also cover this is another article

Ps: Instrument you systems in a way it can be easily replicated to other services. Don’t let it became an overhead while developing new features.

Visualization

To create dashboards, alarms and make efficient use of all this data you could use something like Grafana.

Also a whole topic itself, not in this article scope. Sorry.

Conclusion

Give monitoring a try. It will certainly increase your team productivity, confidence on shipping new features and service quality, efficiency and performance.

Appendix

Sorry for not going deeper in details about each topic but I don’t want to overfeed you with a bunch of details of implementation here. This is not the goal of this article.