o11y-weekly

2023-12-07 #8 Meet Graphite

Graphite has been created by Orbitz (2006), a hotel industry actor to monitor and support their growth.

Other Time Series Database TSDB was already there before like RRDtool.

Why Orbitz decided to not use RRDtool + Cacti and created Graphite ?

Is Graphite still worth over other new and existing solutions ?

Why Graphite ?

Reference: https://graphite.readthedocs.io/en/latest/faq.html#does-graphite-use-rrdtool

A problem in RRDtool is that it does not really support temporary absence of the data (null/nil/None data) and uses zero 0 which is a good default but depending the usage it is not.

How to calculate throughput ? If the latency down to 0, then the throughput is infinite ?

According to the null latency value use case, using 0 is not a good trade-offs at all and this is the first reason why Orbitz team decided to create Graphite (before 2006).

What is Graphite ?

References:

Usually, TSDB backends are used for non functional requirements like request per second, …

According to the bellow use case, Graphite can also be suitable for measuring business values:

“For example, Graphite would be good at graphing stock prices because they are numbers that change over time.”

Prometheus comparison highlights such use case

“Prometheus offers a richer data model and query language, in addition to being easier to run and integrate into your environment. If you want a clustered solution that can hold historical data long term, Graphite may be a better choice.”

The monotonicity and temporality post illustrates the fact and trade-offs.

Graphite is not a monolith and multiple components compose graphite such as carbon

Quickstart

Reference: https://graphite.readthedocs.io/en/latest/install.html#docker

Using graphite with docker is the easiest way to test graphite quickly.

The docker image is not production ready though and many components are installed by default to make it easy to use for development but not for production.

A demo with other backends is available on a previous post demo

Architecture and Scalability

Graphite has been forked and updated to support scalability at different scope over time since the project has a long history since 2006.

Projects:

An excellent (old) post from Teads mentioned how to scale graphite: https://medium.com/teads-engineering/scaling-graphite-in-a-cloud-environment-6a92fb495e5

Graphite can be viewed as a backend or as a protocol and other backends are compatible with it, like prometheus, mimir, victoriametrics but with different aggregation temporality which can conflict with the main feature of graphite (long lived cumulative counter).

⚠️ All backends are not fully compliant with long lived counters and if this feature matter, it is important to scale the data storage first or any other graphite components like the Go Graphite does.

whisper

Reference: https://github.com/graphite-project/whisper

Differences with RRD: https://graphite.readthedocs.io/en/latest/whisper.html#differences-between-whisper-and-rrd

Whisper is the default TSDB with graphite. Graphite can support much more TSDB with different trade-offs (clickhouse, InfluxDB, …).

carbon

References:

Carbon is the write path of the metrics signal. It serves different purpose like:

graphite-web

Reference: https://github.com/graphite-project/graphite-web

As opposed to carbon, graphite-web is responsible for the metric read path. This component serves the api and graph visualization.

Usually, only the api part of graphite-web is used in conjuction with a frontend like grafana.

Protocol

Reference: https://graphite.readthedocs.io/en/latest/feeding-carbon.html

Carbon supports many protocols but the most used is the straightforward plain text protocol.

Plain Text

<metric path> <metric value> <metric timestamp>

PORT=2003
SERVER=graphite.your.org
echo "local.random.diceroll 4 `date +%s`" | nc ${SERVER} ${PORT}

Labels

Reference: https://graphite.readthedocs.io/en/latest/tags.html

Depending the backend configuration, the <metric path> can contain tags (aka labels).

my.series;tag1=value1;tag2=value2

StatsD

Reference: https://www.etsy.com/codeascraft/measure-anything-measure-everything/

StatsD has been created by Etsy to send metrics without performance overhead or simply impacting SLA when the metrics backend is dead. By simply using UDP to send metrics to StatsD, the observed application is not responsible anymore to manage state and is decoupled from the metrics backend which is good if SLAs are different. StatsD also reduces the rate and sends data at a given resolution (ie: 10s).

The protocol is not the same as Graphite but simpler and still plain text: <metricname>:<value>|<type>

echo "foo:1|c" | nc -u -w0 127.0.0.1 8125

A demo is available from this previous post: graphite + statsd vs other backends with this statsd udp configuration

Archiving old data

Reference: https://graphite.readthedocs.io/en/latest/whisper.html#archives-retention-and-precision

Optimizing space over the time is crucial. Data can simply be deleted or compressed. Compression can be lossless or lossy and depending the use case, supporting both can be a good idea.

It is possible to setup lossy compression by increasing the resolution period datapoint. A datapoint can be at a resolution of 10s for the last 3 months then at 1 minute to reduce space by 6 (60s / 10s).

Telemetry temporality

As mentioned in OpenTelemetry metrics temporality and the Monotonicity demo graphite is a delta metrics temporality backend which supports long lived cumulative counter.

Query Language

The query language is really simple and composed by functions and series list.

A series list is a metric with multiple hierachical dimensions. Counting per environment and hostname can be structured like env.hostname.counter:

prod.srv1.requests

While counting for a specific env/hostname prod.srv1.requests does not impact graphite performance, counting all prod requests with graphite using wildcard (prod.*.requests) works but can impact its performance when multiple dimensions are combined (*.*.requests) or simply if the number of servers is high (high cardinality issue).

Counting all requests can become complex as soon as the number of dimensions increases.

Another issue is adding labels afterward impacts queries since a depth is added per added dimension.

⚠️ Adding the application name from env.hostname.counter to env.hostname.application.counter changes the path of the metrics and all related queries is impacted.

The dashboard and queries should care of previous and new hierarchy to avoid loosing the metric history.

Graphite works best when metrics has few dimensions and low cardinality and the hierarchy carefully created.

Additional tools

Grafana

Reference: https://grafana.com/docs/grafana/latest/datasources/graphite/

Grafana comes from the concatenation of 2 words: Graphite and Kibana to make Graphite visualization as smooth as possible.

The main difference between Grafana and other competitors is a cached datasource support without involving a full synchronization which impacts resources and costs.

Grafana offers the best integration for graphite since it has been created for.

According to the OTLP and prometheus Grafana intregration, Grafana metrics backend like Mimir supports only cumulative metrics while graphite is a true delta metrics backend. A dedicated post is comparing the pros and cons of delta and cumulative temporality.

A dedicated post will be created later for Grafana.

Datadog

Reference: https://www.datadoghq.com/blog/dogstatsd-mapper/

Datadog has a centralized model where all the data should be stored inside its database which is a bit different from Grafana since you can choose via a collector to sync or fetch and cache data.

Datadog is a drop-in solution for graphite but seems supporting delta temporality.

A dedicated post will be created later for Datadog.

Backends comparison

Graphite vs VictoriaMetrics vs Prometheus vs Mimir demo from previous post

Conclusion

Graphite and all the middleware/TSDB (StatsD, clickhouse, …) have changed significantly to support labels and scalability. In the meantime, prometheus won the battle for the observability and rates monitoring while delta modes and other usecases are not fully covered by those alternatives.

As mentioned by the prometheus team, graphite is best at supporting long lived cumulative counters with few labels.

As soon as scalability becomes important for metrics, labels and pure observability, other solutions should be considered.