OngoingAIOngoingAI Docs

OpenTelemetry

Use this page to configure OpenTelemetry (OTEL) trace and metric export from OngoingAI Gateway to an OTLP-compatible collector, or to expose a native Prometheus scrape endpoint.

When to use this integration

  • You run an observability stack that accepts OTLP data, such as Jaeger, Grafana Tempo, Datadog, Honeycomb, or a standalone OpenTelemetry Collector.
  • You need distributed tracing across the gateway and upstream provider calls.
  • You need gateway-level operational metrics for queue health, provider latency, and write reliability.
  • You want tenant-scoped span attributes for multi-tenant observability filtering.
  • You run Prometheus and want to scrape gateway metrics directly.
  • You want end-to-end correlation across all three observability pillars: jump from a log line to its trace, or from a latency spike on a dashboard to the exact request trace that caused it.

How it works

When enabled, the gateway creates OTLP HTTP exporters at startup:

  1. A trace exporter that batches and sends spans to your collector.
  2. A metric exporter that periodically pushes counters, histograms, and gauges.
  3. A Prometheus scrape endpoint (/metrics by default) that serves metrics in Prometheus exposition format when prometheus_enabled is true.
  4. A credential scrubbing exporter that wraps the trace exporter and sanitizes all span attribute values before they leave the process, providing defense-in-depth against credential leaks in telemetry.
  5. Go runtime metrics (go_memory_*, go_goroutine_*, go_gc_*, go_sched_*) are registered automatically for process-level health monitoring.

The gateway wraps its HTTP server and upstream transport with OpenTelemetry instrumentation. Each inbound request produces a server span, and each upstream proxy call produces a client span as a child of the server span. Dedicated child spans cover auth evaluation, provider routing, trace enqueue, and storage writes.

All three observability pillars are correlated end-to-end:

  • Logs → Traces: Structured JSON logs automatically include trace_id and span_id from the active request span, so any log line can be joined to its distributed trace.
  • Metrics → Traces: Histogram exemplars on proxy and provider latency metrics carry the trace_id of the recorded request. In Grafana or any exemplar-aware dashboard, clicking a latency spike jumps directly to the trace that caused it.

Trace context propagates using the W3C Trace Context standard.

Configuration

YAML configuration

Add an observability.otel section to ongoingai.yaml:

YAML
observability:
  otel:
    enabled: true
    endpoint: localhost:4318
    insecure: true
    service_name: ongoingai-gateway
    traces_enabled: true
    metrics_enabled: true
    prometheus_enabled: false
    prometheus_path: /metrics
    sampling_ratio: 1.0
    export_timeout_ms: 3000
    metric_export_interval_ms: 10000

Field reference

FieldTypeDefaultNotes
observability.otel.enabledboolfalseMaster toggle. Set to true to activate OTEL export.
observability.otel.endpointstringlocalhost:4318OTLP HTTP collector endpoint. Accepts host:port or a full URL.
observability.otel.insecurebooltrueUse plain HTTP. Set to false for HTTPS.
observability.otel.service_namestringongoingai-gatewayValue for the service.name resource attribute.
observability.otel.traces_enabledbooltrueEnable trace span export.
observability.otel.metrics_enabledbooltrueEnable OTLP push metric export.
observability.otel.prometheus_enabledboolfalseEnable native Prometheus scrape endpoint.
observability.otel.prometheus_pathstring/metricsPath for the Prometheus scrape endpoint. Must start with / and must not overlap with /api, /openai, or /anthropic.
observability.otel.sampling_ratiofloat1.0Trace sampling ratio from 0.0 (none) to 1.0 (all). Uses parent-based sampling with trace ID ratio.
observability.otel.export_timeout_msint3000Timeout in milliseconds for each export request to the collector.
observability.otel.metric_export_interval_msint10000Interval in milliseconds between periodic metric exports.

Environment variables

The gateway also accepts standard OpenTelemetry environment variables. These follow the same precedence as other env overrides: they apply after YAML values.

VariableEffect
OTEL_SDK_DISABLEDSet to true to disable OTEL entirely.
OTEL_EXPORTER_OTLP_ENDPOINTOTLP collector endpoint. If set, OTEL is auto-enabled.
OTEL_EXPORTER_OTLP_INSECURESet to true for plain HTTP transport.
OTEL_SERVICE_NAMEOverride service_name.
OTEL_TRACES_EXPORTERSet to otlp to enable traces, or none to disable.
OTEL_METRICS_EXPORTERSet to otlp to enable push metrics, prometheus to enable Prometheus scrape mode, or none to disable.
OTEL_TRACES_SAMPLER_ARGSampling ratio as a float (for example, 0.5).
OTEL_EXPORTER_OTLP_TIMEOUTExport timeout in milliseconds.
OTEL_METRIC_EXPORT_INTERVALMetric export interval in milliseconds.
ONGOINGAI_PROMETHEUS_ENABLEDSet to true to enable Prometheus scrape endpoint.
ONGOINGAI_PROMETHEUS_PATHOverride the Prometheus endpoint path (default /metrics).

Setting OTEL_EXPORTER_OTLP_ENDPOINT to a non-empty value automatically enables OTEL export, even if observability.otel.enabled is false in YAML.

Setting OTEL_METRICS_EXPORTER=prometheus enables Prometheus mode and disables OTLP push metrics. This is equivalent to setting prometheus_enabled: true and metrics_enabled: false.

Endpoint format

The endpoint field accepts two formats:

  • Host and port: localhost:4318 or collector.internal:4318. The insecure field controls whether the gateway uses HTTP or HTTPS.
  • Full URL: http://collector.internal:4318 or https://collector.example.com:4318. The URL scheme overrides the insecure setting. An http:// scheme forces insecure mode, and an https:// scheme forces secure mode.

Traces

Inbound request spans

The gateway creates a server span for each incoming HTTP request. The span name uses the route pattern:

Request pathSpan name
/openai/...POST /openai/*
/anthropic/...GET /anthropic/*
/api/...POST /api/*
Other pathsPOST /other

The HTTP method in the span name matches the actual request method.

Upstream proxy spans

Each request forwarded to an upstream provider creates a child client span. The span name is prefixed with proxy:

Request pathSpan name
/openai/...proxy POST /openai/*
/anthropic/...proxy POST /anthropic/*

Auth evaluation spans

The gateway.auth span wraps the auth middleware and records whether the request was allowed or denied.

AttributeDescription
gateway.auth.resultallow or deny.
gateway.auth.deny_reasonunauthorized or forbidden (only set on deny).

On deny, the span status is set to Error with the deny reason.

Route spans

The gateway.route span wraps provider routing and records the matched provider and route prefix.

AttributeDescription
gateway.route.providerMatched provider: openai, anthropic, or unknown.
gateway.route.prefixMatched route prefix: /openai, /anthropic, or /.
gateway.org_idOrganization ID from the authenticated gateway key.
gateway.workspace_idWorkspace ID from the authenticated gateway key.

The span status is set to Error for HTTP 5xx responses.

Trace enqueue spans

The gateway.trace.enqueue span records whether a trace was accepted into the async write queue or dropped due to backpressure.

AttributeDescription
gateway.trace.enqueue.resultaccepted or dropped.
gateway.org_idOrganization ID from the authenticated gateway key.
gateway.workspace_idWorkspace ID from the authenticated gateway key.

On drop, the span status is set to Error with message trace dropped.

Trace write spans

The gateway.trace.write span records each storage write batch from the async trace writer.

AttributeDescription
gateway.trace.write.batch_sizeNumber of traces in the write batch.
gateway.trace.write.error_classError classification on failure (credential-scrubbed).

On write failure, the span status is set to Error with message write failed. The error_class value is sanitized by the credential scrubbing layer to prevent credential leakage in error messages.

Gateway attributes

After authentication completes, the span enrichment middleware adds tenant identity attributes to the active server span:

AttributeDescription
gateway.correlation_idCorrelation ID linking logs, spans, and traces.
gateway.org_idOrganization ID from the authenticated gateway key.
gateway.workspace_idWorkspace ID from the authenticated gateway key.
gateway.key_idGateway API key identifier.
gateway.roleRole assigned to the gateway key.

These attributes are only added when auth.enabled=true and the request authenticates successfully.

Error handling

The gateway sets span status to Error for HTTP 5xx responses from upstream providers. The status message includes the HTTP status code, such as http 502. HTTP 4xx responses do not set error status on the span.

Resource attributes

All spans and metrics include the following resource attributes:

AttributeValue
service.nameValue of observability.otel.service_name.
service.versionGateway binary version.

Observability correlation

The gateway connects all three observability pillars so you can move between logs, traces, and metrics without manual ID lookups.

Logs to traces

Every structured JSON log line emitted during an active request span includes trace_id and span_id fields:

JSON
{
  "time": "2025-01-15T10:30:00Z",
  "level": "INFO",
  "msg": "captured exchange",
  "trace_id": "a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4",
  "span_id": "1a2b3c4d5e6f7a8b",
  "correlation_id": "corr-abc-123",
  "path": "/openai/v1/chat/completions",
  "status": 200
}

This is implemented via the TraceLogHandler slog wrapper, which injects trace context from the active span into every log record. Use these fields to join log lines to their distributed trace in Grafana Loki, Elasticsearch, Datadog, or any log backend that supports trace correlation.

Metrics to traces (exemplars)

The proxy and provider latency histograms (ongoingai.proxy.request_duration_seconds and ongoingai.provider.request_duration_seconds) attach exemplars with the trace_id and span_id of the request that produced each measurement.

In practice, this means:

  • A p99 latency spike on a Grafana dashboard has a clickable exemplar dot that opens the exact trace responsible.
  • Prometheus stores exemplars alongside histogram buckets when --enable-feature=exemplar-storage is active.
  • Grafana Tempo, Jaeger, and other trace backends can receive the jump from the exemplar link.

Exemplars are enabled automatically. They fire whenever the request context carries a sampled span, so sampling controls exemplar volume with no additional configuration.

Correlation summary

FromToMechanism
Log lineTracetrace_id and span_id in structured JSON logs
Metric data pointTraceHistogram exemplar with trace_id and span_id
Trace spanLogsFilter logs by trace_id in your log backend
Trace spanMetricsSpan attributes match metric label dimensions

Metrics

The gateway exports 12 metric instruments organized in three groups.

Trace pipeline metrics

ongoingai.trace.queue_dropped_total

Type: Int64Counter

Counts trace records dropped because the async trace queue was full.

AttributeDescription
providerProvider name (openai, anthropic).
modelModel name from the request (unknown when unavailable).
org_idOrganization ID.
workspace_idWorkspace ID.
routeRoute pattern (/openai/*, /anthropic/*, /api/*, /other).
status_codeHTTP response status code.

ongoingai.trace.write_failed_total

Type: Int64Counter

Counts trace records dropped after a storage write failure.

AttributeDescription
operationWrite operation that failed (write_trace, write_batch_fallback).
error_classClassified failure: connection, timeout, contention, constraint, or unknown.
storeTrace storage backend (sqlite, postgres).

ongoingai.trace.enqueued_total

Type: Int64Counter

Counts traces successfully enqueued to the async write queue. No attributes.

ongoingai.trace.written_total

Type: Int64Counter

Counts traces successfully persisted to storage. No attributes.

ongoingai.trace.flush_duration_seconds

Type: Float64Histogram Unit: seconds

Time to flush a batch of traces to storage. Uses custom bucket boundaries optimized for fast database writes: 1ms, 2.5ms, 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s. No attributes.

ongoingai.trace.flush_batch_size

Type: Int64Histogram

Number of traces per flush batch. No attributes.

ongoingai.trace.queue_depth

Type: Int64ObservableGauge

Current number of traces waiting in the async write queue. Sampled each collection cycle. No attributes.

ongoingai.trace.queue_capacity

Type: Int64ObservableGauge

Capacity of the async trace write queue. Sampled each collection cycle. No attributes.

Provider metrics

ongoingai.provider.request_total

Type: Int64Counter

Counts upstream provider requests.

AttributeDescription
providerProvider name (openai, anthropic).
modelModel name from the request.
org_idOrganization ID (unknown when unavailable).
workspace_idWorkspace ID (unknown when unavailable).
routeRoute pattern (/openai/*, /anthropic/*, /api/*, /other).
status_codeHTTP response status code from the provider.

ongoingai.provider.request_duration_seconds

Type: Float64Histogram Unit: seconds

Upstream provider request duration. Uses custom bucket boundaries optimized for AI API response times: 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s, 10s. Exemplars are attached with the trace_id and span_id of the recorded request.

AttributeDescription
providerProvider name (openai, anthropic).
modelModel name from the request.
org_idOrganization ID (unknown when unavailable).
workspace_idWorkspace ID (unknown when unavailable).
routeRoute pattern (/openai/*, /anthropic/*, /api/*, /other).

Proxy metrics

ongoingai.proxy.request_total

Type: Int64Counter

Counts proxy requests with tenant scoping.

AttributeDescription
providerProvider name (openai, anthropic).
modelModel name from the request (unknown when unavailable).
org_idOrganization ID.
workspace_idWorkspace ID.
routeRoute pattern (/openai/*, /anthropic/*, /api/*, /other).
status_codeHTTP response status code.

ongoingai.proxy.request_duration_seconds

Type: Float64Histogram Unit: seconds

Proxy request duration with tenant scoping. Uses the same custom bucket boundaries as the provider histogram (5ms to 10s). Exemplars are attached with the trace_id and span_id of the recorded request.

AttributeDescription
providerProvider name (openai, anthropic).
modelModel name from the request (unknown when unavailable).
org_idOrganization ID.
workspace_idWorkspace ID.
routeRoute pattern (/openai/*, /anthropic/*, /api/*, /other).

Prometheus

The gateway can expose a native Prometheus scrape endpoint that serves all 12 metric instruments in Prometheus exposition format. This is an alternative to OTLP push metrics for teams that run Prometheus-based monitoring.

Configuration

Enable Prometheus in YAML:

YAML
observability:
  otel:
    enabled: true
    service_name: ongoingai-gateway
    traces_enabled: false
    metrics_enabled: false
    prometheus_enabled: true
    prometheus_path: /metrics

Or with environment variables:

Bash
ONGOINGAI_PROMETHEUS_ENABLED=true \
ONGOINGAI_PROMETHEUS_PATH=/metrics \
ongoingai serve --config ongoingai.yaml

Setting OTEL_METRICS_EXPORTER=prometheus also enables Prometheus mode and disables OTLP push metrics.

Verify the endpoint

After starting the gateway with Prometheus enabled:

Bash
curl http://localhost:8080/metrics

You should see Prometheus exposition format output with ongoingai_ prefixed metric names.

Grafana / Prometheus scrape config

Add the gateway as a scrape target in your Prometheus configuration:

YAML
scrape_configs:
  - job_name: ongoingai-gateway
    scrape_interval: 15s
    static_configs:
      - targets: ["localhost:8080"]
    metrics_path: /metrics

Using both Prometheus and OTLP push

You can enable both Prometheus scrape and OTLP push metrics simultaneously:

YAML
observability:
  otel:
    enabled: true
    endpoint: localhost:4318
    insecure: true
    service_name: ongoingai-gateway
    traces_enabled: true
    metrics_enabled: true
    prometheus_enabled: true

Both exporters read from the same meter provider, so metric values are consistent across both surfaces.

Go runtime metrics

The gateway automatically registers Go runtime metrics when metrics export is enabled (OTLP push or Prometheus). These appear alongside gateway metrics and include:

  • go_memory_classes_heap_objects_bytes — heap memory in use
  • go_goroutine_count — active goroutine count
  • go_gc_duration_seconds — GC pause durations
  • go_sched_goroutines_goroutines — scheduler goroutine count

These are useful for monitoring gateway process health and capacity planning.

Enabling exemplar storage in Prometheus

To use histogram exemplars for metrics-to-traces correlation, enable exemplar storage in Prometheus:

YAML
# Start Prometheus with --enable-feature=exemplar-storage
# or add to your Prometheus config:
global:
  scrape_interval: 15s
 
scrape_configs:
  - job_name: ongoingai-gateway
    scrape_interval: 15s
    static_configs:
      - targets: ["localhost:8080"]
    metrics_path: /metrics

In Grafana, exemplars appear as dots on histogram panels. Clicking an exemplar dot opens the linked trace in your configured trace data source (Tempo, Jaeger, etc.).

Credential scrubbing

The gateway applies defense-in-depth credential scrubbing to all telemetry exports.

How it works

A scrubbing exporter wraps the OTLP trace exporter and sanitizes all string attribute values before they leave the process. The scrubbing runs in the async batch export goroutine, not on the request hot path.

The MakeWriteSpanHook also sanitizes error messages recorded in gateway.trace.write spans via ScrubCredentials.

Patterns detected

PatternExamples
Token prefixessk_..., pk_..., rk_..., xoxb_..., ghp_..., pat_...
JWTseyJ... (three dot-separated base64url segments)
Bearer tokensBearer <token> in header-like strings
Connection string secretspassword=..., secret=..., token=...

All detected patterns are replaced with [CREDENTIAL_REDACTED].

Safety guarantees

  • Metric label values are trimmed and credential-scrubbed before export. Any detected credential pattern is replaced with [CREDENTIAL_REDACTED].
  • Missing request-scope label values are emitted as unknown to preserve a stable metric schema.
  • Span attributes that could carry credential data (error messages, status descriptions) are scrubbed before export.
  • Clean spans with no credential patterns pass through with zero allocation overhead.

Alerting recommendations

These PromQL examples target common gateway failure modes. Adjust thresholds for your traffic volume and SLO targets.

Trace queue drops

Alert when traces are being dropped due to queue backpressure:

Promql
increase(ongoingai_trace_queue_dropped_total[5m]) > 0

Any nonzero value indicates trace data loss. Investigate storage throughput and connectivity.

Trace write failures

Alert when storage writes are failing:

Promql
increase(ongoingai_trace_write_failed_total[5m]) > 0

Check error_class label for failure classification (connection, timeout, contention, constraint).

Queue saturation

Alert when the trace queue is near capacity:

Promql
ongoingai_trace_queue_depth / ongoingai_trace_queue_capacity > 0.9

Sustained high saturation precedes queue drops. Scale storage throughput or reduce capture load.

Provider error rate

Alert on elevated provider error rates:

Promql
sum(rate(ongoingai_provider_request_total{status_code=~"5.."}[5m]))
  /
sum(rate(ongoingai_provider_request_total[5m])) > 0.05

A 5% error rate threshold is a reasonable starting point. Break down by provider and model labels to isolate the source.

High proxy latency

Alert on elevated proxy latency:

Promql
histogram_quantile(0.99,
  rate(ongoingai_proxy_request_duration_seconds_bucket[5m])
) > 10

Adjust the quantile and threshold to match your latency SLO.

Shutdown behavior

On SIGINT or SIGTERM, the gateway flushes pending telemetry data before exiting:

  1. The HTTP server stops accepting new connections and completes in-flight requests (5-second timeout).
  2. The trace writer drains its queue and flushes remaining trace records to storage (5-second timeout).
  3. The OpenTelemetry trace provider flushes buffered spans to the collector (5-second timeout).
  4. The OpenTelemetry metric provider flushes buffered metrics to the collector (5-second timeout).

If any flush step exceeds its timeout, the gateway logs an error and continues with the remaining shutdown steps.

Example configurations

Local development with Jaeger

Start Jaeger with OTLP ingestion:

Bash
docker run -d --name jaeger \
  -p 16686:16686 \
  -p 4318:4318 \
  jaegertracing/all-in-one:latest

Configure the gateway to export to Jaeger:

YAML
observability:
  otel:
    enabled: true
    endpoint: localhost:4318
    insecure: true
    service_name: ongoingai-gateway
    traces_enabled: true
    metrics_enabled: false
    sampling_ratio: 1.0

After sending traffic through the gateway, open http://localhost:16686 to view traces in the Jaeger UI.

Production with an OTLP collector

Configure the gateway to export to a remote collector over HTTPS:

YAML
observability:
  otel:
    enabled: true
    endpoint: https://otel-collector.internal:4318
    service_name: ongoingai-gateway
    traces_enabled: true
    metrics_enabled: true
    sampling_ratio: 0.1
    export_timeout_ms: 5000
    metric_export_interval_ms: 30000

In production, consider reducing sampling_ratio to control trace volume. A ratio of 0.1 samples 10% of requests. Parent-based sampling ensures that if an incoming request already carries a sampled trace context, the gateway respects that decision regardless of the local ratio.

Prometheus-only mode

Export metrics via Prometheus without an OTLP collector:

YAML
observability:
  otel:
    enabled: true
    service_name: ongoingai-gateway
    traces_enabled: false
    metrics_enabled: false
    prometheus_enabled: true
    prometheus_path: /metrics

Environment variable quickstart

Enable OTEL export without modifying YAML:

Bash
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 \
OTEL_SERVICE_NAME=ongoingai-gateway \
ongoingai serve --config ongoingai.yaml

Enable Prometheus scrape via env vars:

Bash
OTEL_METRICS_EXPORTER=prometheus \
ongoingai serve --config ongoingai.yaml

Validation checklist

  1. Verify that your OTLP collector is reachable from the gateway host:

    Bash
    curl -s -o /dev/null -w "%{http_code}" http://localhost:4318/v1/traces

    A 405 or 200 response confirms the collector is listening.

  2. Start the gateway with OTEL enabled:

    Bash
    ongoingai serve --config ongoingai.yaml
  3. Send a proxied request through the gateway:

    Bash
    curl http://localhost:8080/openai/v1/chat/completions \
      -H "Authorization: Bearer OPENAI_API_KEY" \
      -H "Content-Type: application/json" \
      -d '{"model": "gpt-4o", "messages": [{"role": "user", "content": "hello"}]}'

    Placeholder:

    • OPENAI_API_KEY: Your OpenAI API key.
  4. Check your collector or tracing UI for spans with service name ongoingai-gateway.

You should see at least two spans: one for the inbound request and one for the upstream proxy call. With auth enabled, you will also see gateway.auth and gateway.route child spans.

Troubleshooting

Spans do not appear in the collector

  • Symptom: The tracing UI shows no data for ongoingai-gateway.
  • Cause: OTEL is not enabled, or the collector endpoint is unreachable.
  • Fix: Verify that observability.otel.enabled is true and that the endpoint value is reachable from the gateway. Check gateway logs for export timeout errors.

Gateway attributes are missing from spans

  • Symptom: Spans appear but lack gateway.org_id, gateway.workspace_id, and other tenant attributes.
  • Cause: Gateway auth is not enabled, or the request did not include a valid gateway key.
  • Fix: Set auth.enabled=true and include a valid gateway key in the request header.

Sampling drops more traces than expected

  • Symptom: Only a fraction of requests produce spans.
  • Cause: sampling_ratio is set below 1.0.
  • Fix: Increase sampling_ratio toward 1.0 for higher coverage. A value of 1.0 samples all requests.

Export timeout errors in gateway logs

  • Symptom: Gateway logs contain export timeout errors on shutdown or during operation.
  • Cause: The collector is slow to respond, or export_timeout_ms is too low for your network.
  • Fix: Increase export_timeout_ms or verify collector performance.

Metrics are not exported

  • Symptom: Traces appear in the collector but metrics do not.
  • Cause: metrics_enabled is false, or the collector does not accept OTLP metrics on the configured endpoint.
  • Fix: Set metrics_enabled to true and verify that the collector supports OTLP metric ingestion on the same endpoint. Alternatively, enable prometheus_enabled to scrape metrics directly.

Prometheus /metrics returns 404

  • Symptom: curl http://localhost:8080/metrics returns 404.
  • Cause: prometheus_enabled is not set to true, or prometheus_path does not match the request path.
  • Fix: Set observability.otel.prometheus_enabled: true in YAML or ONGOINGAI_PROMETHEUS_ENABLED=true as an env var. Verify that prometheus_path matches the path you are requesting.

Credential patterns appear in spans

  • Symptom: Span attributes contain API keys or tokens.
  • Cause: This should not happen when the scrubbing exporter is active. The scrubbing exporter is automatically enabled when traces are enabled.
  • Fix: Verify that traces_enabled: true is set. If you see credential material in spans despite this, file a bug report.

Config validation fails with OTEL settings

  • Symptom: ongoingai config validate rejects the OTEL configuration.
  • Cause: A required field is empty or a numeric value is out of range.
  • Fix: Verify that endpoint and service_name are non-empty, that sampling_ratio is between 0.0 and 1.0, and that timeout values are positive integers. When prometheus_enabled is true, verify that prometheus_path starts with / and does not overlap with /api, /openai, or /anthropic.

Next steps