o11y-weekly

2023-10-27 #3 Defining Good Observability backend Trade-offs and QoS

This week in o11y, the third post about Trade-offs and QoS management in observability backend.

The previous post introduced high level QoS and trade-offs. This post focuses on defining QoS and trade-offs for the observability backend side only.

Trade-offs and QoS

This post is an example of what could be good trade-offs and QoS. If the purpose is to scale and integrate more teams or simply tackling a legacy system, it is always a good idea to think about data life cycle.

Starting from the 2 quotes of the last week;

Quote:

“Simplicity is prerequisite for reliability” Edsger Dijkstra.

Reference: QoS in cloud computing

“Quality-of-Service (QoS) management, [..] is the problem of allocating resources to the application to guarantee a service level along dimensions such as performance, availability and reliability.”

For each signal, those properties are very important to know when using SAAS products (GrafanaCloud, DataDog, Splunk and so on) because it will impact the bill in the end of month or define the required skills of the Ops and SRE team responsible to build and maintain the observability backends.

The motto is to keep it stupid simple and if the use case is too much high, it just means that it is not a full observability solution.

Non Functional Requirements (NFRs)

Using observability solutions to cover a risky business can be challenging and those Non Functional Requirements should be defined day 1 to avoid a complete SLA mess.

Is it a good thing to measure business SLAs on an observability stack which does not guarantee equal or higher SLAs ?

How about finding reasonable trade-offs instead of supporting higher SLAs than required. Is it required to support / pay a complex observability stack with higher SLAs than required ?

According to Edsger Dijkstra, “Simplicity is prerequisite for reliability” and this is why defining those NFRs is important to reduce the complexity to support high SLA.

Creating a POC to analyze output is a good idea by using an agent and a file output. By using opentelemetry collector contrib with matching input sources (prometheus / graphite / file log / tracing / receivers which fit with the telemetry app) and otlp file output, it is really easy to estimate required telemetry capacity.

Combining fulltext search + at least once delivery + high precision at high rates and large document for a high risk to cover is just too much high!

If thoses properties are important and the risk is high, how about investing on a dataplatform (opentelemetry collector contrib kafka exporter, bigquery, tableau, grafana+biquery…)?

Simply just replacing fulltext search by 1 dimensions reduces the complexity.

At least once delivery does not fit with collector and instrumentation/logs libraries. In case of a failure, messages are allocated somewhere with a finite capacity and always drop new messages in case of errors.

A message-oriented middleware should be in place on the observability backend side instead of logging libraries to support such properties. As example, ZMQ “high water mark” is a very well documented 0 broker which can support such guarantee but by blocking in worst case. Combined with a queueing system, it can be really easy to support those guarantees.

While integrating more complex datasource to support corner cases, it is still possible to use observability backends or to support datasource apis like in Grafana to visualize datapoints.

Five whys

Make it simple

Conclusion

Keep observability solution as simple as possible by defining trade-offs which fit with observability tools, identify incompatibility and fix them by using appropriate ones.

Precision in telemetry is a hard topic and monotonicity is a crucial topic for the next post.