Skip to content

Span MetricsΒΆ

Span metrics are collected for each span and provide insight into the performance of your application. On our platform, these metrics are generated by the Grafana Tempo metrics generator and exported to Prometheus. They help monitor request rates, error rates, latency, and payload sizes for instrumented services.

The following metrics are available:

  • traces_spanmetrics_calls_total: Counter for the total number of spans (requests) processed, labeled by service, operation, status code, and more.
  • traces_spanmetrics_latency_bucket: Histogram bucket for span latency (duration), useful for analyzing latency distribution and setting SLOs.
  • traces_spanmetrics_latency_count: Total count of observed span latency measurements.
  • traces_spanmetrics_latency_sum: Sum of all span latencies, used to calculate average latency.
  • traces_spanmetrics_size_total: Total size in bytes of all observed spans for monitoring payload sizes.

Available LabelsΒΆ

Span metrics include a rich set of labels (dimensions) for detailed analysis. Note that in Prometheus any . (dot) in the label name is replaced by an _ (underscore). The following labels are available:

Label Scope Description
service_name General The name of the service executing the operation.
service_namespace General The namespace in which the service is deployed.
k8s_cluster_name General The name of the Kubernetes cluster hosting the service.
span_kind General The kind of span (e.g., SERVER, CLIENT, PRODUCER, CONSUMER).
span_name General The name of the span, typically the operation or endpoint name.
status_code General The status code of the span (e.g., OK, ERROR, UNSET).
server_address HTTP The address of the server handling the request.
http_status_code HTTP The HTTP status code returned for the request.
http_response_status_code HTTP An alternative label for the HTTP response status code.
http_host HTTP The host header value from the HTTP request.
db_system Database The type of database (e.g., PostgreSQL, MySQL).
db_name Database The name of the database being accessed.
db_operation Database The database operation performed (e.g., SELECT, INSERT).
messaging_system Messaging (Async) The messaging system used (e.g., Kafka, RabbitMQ).
messaging_destination_name Messaging (Async) The destination name, such as a topic or queue.
messaging_operation Messaging (Async) The operation performed on the messaging system.

These labels help in analyzing metrics by service, namespace, cluster, span kind, span name, status code, HTTP status, database, messaging system, and more.

Span NameΒΆ

The span name uniquely identifies the traced operation or endpoint. It should be a concise and descriptive label that consistently reflects the purpose of the span. This consistency assists in aggregating and visualizing metrics, enabling effective filtering and troubleshooting.

Key recommendations for span names:

  • Use clear, descriptive names that capture the specific operation performed.
  • Maintain consistency across similar operations to facilitate aggregation.
  • Avoid generic names to ensure clarity when filtering or grouping data.

Span KindΒΆ

Metrics are collected only for specific span kinds as defined by the OpenTelemetry specification. These include:

  • SERVER: For server-side handling of requests (e.g., an HTTP server receiving a request).
  • CLIENT: For client-side operations (e.g., outbound HTTP or database requests).
  • PRODUCER: For sending messages to messaging systems (e.g., publishing to a Kafka topic).
  • CONSUMER: For receiving messages from messaging systems (e.g., reading from a Kafka topic).

Spans with a resource attribute resource.service.name equal to nais-ingress are excluded from metrics to ensure that only application-relevant data is recorded.

Span StatusΒΆ

Span status indicates the outcome of an operation, providing context for troubleshooting. The common span status codes are:

  • OK: The span completed successfully.
  • ERROR: An error occurred during the span's execution.
  • UNSET: No explicit status was set, indicating an undefined state.

Monitoring span status alongside other metrics can quickly identify issues related to failed operations or unexpected behavior.

Example PromQL QueriesΒΆ

Below are some example PromQL queries for using span metrics in Prometheus or Grafana:

  • Requests per second for a specific service:
sum(rate(traces_spanmetrics_calls_total{service_name="my-app"}[5m]))
  • 99th percentile latency for a specific HTTP endpoint:
histogram_quantile(0.99, sum(rate(traces_spanmetrics_latency_bucket{service_name="my-app", http_host="api.example.com", http_status_code="200"}[5m])) by (le))
  • Error rate for a service (using HTTP status code):
sum(rate(traces_spanmetrics_calls_total{service_name="my-app", http_status_code=~"5.."}[5m]))
  • Database operation latency (e.g., for PostgreSQL SELECT queries):
histogram_quantile(0.95, sum(rate(traces_spanmetrics_latency_bucket{db_system="postgresql", db_operation="SELECT"}[5m])) by (le))
  • Request count by HTTP host (for multi-tenant apps):
sum(rate(traces_spanmetrics_calls_total{service_name="my-app"}[5m])) by (http_host)
  • Compare latency across environments (clusters):
histogram_quantile(0.90, sum(rate(traces_spanmetrics_latency_bucket{service_name="my-app"}[5m])) by (le, k8s_cluster_name))

Note: Adjust the label filters to match your application's configuration. Use only the appropriate set of labels relevant to each span context to maintain data clarity and consistency.

These metrics and queries provide a comprehensive overview of your application's distributed traces, aiding in observability and troubleshooting.