OpenTelemetry
Use this page to configure OpenTelemetry (OTEL) trace and metric export from OngoingAI Gateway to an OTLP-compatible collector.
When to use this integration
- You run an observability stack that accepts OTLP data, such as Jaeger, Grafana Tempo, Datadog, Honeycomb, or a standalone OpenTelemetry Collector.
- You need distributed tracing across the gateway and upstream provider calls.
- You need gateway-level operational metrics for queue health and write reliability.
- You want tenant-scoped span attributes for multi-tenant observability filtering.
How it works
When enabled, the gateway creates two OTLP HTTP exporters at startup:
- A trace exporter that batches and sends spans to your collector.
- A metric exporter that periodically pushes counters and gauges.
The gateway wraps its HTTP server and upstream transport with OpenTelemetry instrumentation. Each inbound request produces a server span, and each upstream proxy call produces a client span as a child of the server span. After authentication completes, a span enrichment middleware adds tenant identity attributes to the active span.
Trace context propagates using the W3C Trace Context standard.
Configuration
YAML configuration
Add an observability.otel section to ongoingai.yaml:
observability:
otel:
enabled: true
endpoint: localhost:4318
insecure: true
service_name: ongoingai-gateway
traces_enabled: true
metrics_enabled: true
sampling_ratio: 1.0
export_timeout_ms: 3000
metric_export_interval_ms: 10000Field reference
| Field | Type | Default | Notes |
|---|---|---|---|
observability.otel.enabled | bool | false | Master toggle. Set to true to activate OTEL export. |
observability.otel.endpoint | string | localhost:4318 | OTLP HTTP collector endpoint. Accepts host:port or a full URL. |
observability.otel.insecure | bool | true | Use plain HTTP. Set to false for HTTPS. |
observability.otel.service_name | string | ongoingai-gateway | Value for the service.name resource attribute. |
observability.otel.traces_enabled | bool | true | Enable trace span export. |
observability.otel.metrics_enabled | bool | true | Enable metric export. |
observability.otel.sampling_ratio | float | 1.0 | Trace sampling ratio from 0.0 (none) to 1.0 (all). Uses parent-based sampling with trace ID ratio. |
observability.otel.export_timeout_ms | int | 3000 | Timeout in milliseconds for each export request to the collector. |
observability.otel.metric_export_interval_ms | int | 10000 | Interval in milliseconds between periodic metric exports. |
Environment variables
The gateway also accepts standard OpenTelemetry environment variables. These follow the same precedence as other env overrides: they apply after YAML values.
| Variable | Effect |
|---|---|
OTEL_SDK_DISABLED | Set to true to disable OTEL entirely. |
OTEL_EXPORTER_OTLP_ENDPOINT | OTLP collector endpoint. If set, OTEL is auto-enabled. |
OTEL_EXPORTER_OTLP_INSECURE | Set to true for plain HTTP transport. |
OTEL_SERVICE_NAME | Override service_name. |
OTEL_TRACES_EXPORTER | Set to otlp to enable traces, or none to disable. |
OTEL_METRICS_EXPORTER | Set to otlp to enable metrics, or none to disable. |
OTEL_TRACES_SAMPLER_ARG | Sampling ratio as a float (for example, 0.5). |
OTEL_EXPORTER_OTLP_TIMEOUT | Export timeout in milliseconds. |
OTEL_METRIC_EXPORT_INTERVAL | Metric export interval in milliseconds. |
Setting OTEL_EXPORTER_OTLP_ENDPOINT to a non-empty value automatically
enables OTEL export, even if observability.otel.enabled is false in YAML.
Endpoint format
The endpoint field accepts two formats:
- Host and port:
localhost:4318orcollector.internal:4318. Theinsecurefield controls whether the gateway uses HTTP or HTTPS. - Full URL:
http://collector.internal:4318orhttps://collector.example.com:4318. The URL scheme overrides theinsecuresetting. Anhttp://scheme forces insecure mode, and anhttps://scheme forces secure mode.
Traces
Inbound request spans
The gateway creates a server span for each incoming HTTP request. The span name uses the route pattern:
| Request path | Span name |
|---|---|
/openai/... | POST /openai/* |
/anthropic/... | GET /anthropic/* |
/api/... | POST /api/* |
| Other paths | POST /other |
The HTTP method in the span name matches the actual request method.
Upstream proxy spans
Each request forwarded to an upstream provider creates a child client span. The
span name is prefixed with proxy:
| Request path | Span name |
|---|---|
/openai/... | proxy POST /openai/* |
/anthropic/... | proxy POST /anthropic/* |
Gateway attributes
After authentication completes, the gateway adds tenant identity attributes to the active span:
| Attribute | Description |
|---|---|
gateway.org_id | Organization ID from the authenticated gateway key. |
gateway.workspace_id | Workspace ID from the authenticated gateway key. |
gateway.key_id | Gateway API key identifier. |
gateway.role | Role assigned to the gateway key. |
These attributes are only added when auth.enabled=true and the request
authenticates successfully.
Error handling
The gateway sets span status to Error for HTTP 5xx responses from upstream
providers. The status message includes the HTTP status code, such as http 502.
HTTP 4xx responses do not set error status on the span.
Resource attributes
All spans and metrics include the following resource attributes:
| Attribute | Value |
|---|---|
service.name | Value of observability.otel.service_name. |
service.version | Gateway binary version. |
Metrics
The gateway exports two custom counters:
ongoingai.trace.queue_dropped_total
Counts trace records dropped because the async trace queue was full.
| Attribute | Description |
|---|---|
route | Route pattern such as /openai/* or /api/*. |
status_code | HTTP response status code. |
ongoingai.trace.write_failed_total
Counts trace records dropped after a storage write failure.
| Attribute | Description |
|---|---|
operation | Write operation that failed, such as write_trace or write_batch_fallback. |
Both counters help you detect trace pipeline backpressure and storage issues before they affect audit completeness. A sustained increase in either counter indicates that trace data is being lost.
Shutdown behavior
On SIGINT or SIGTERM, the gateway flushes pending telemetry data before
exiting:
- The HTTP server stops accepting new connections and completes in-flight requests (5-second timeout).
- The trace writer drains its queue and flushes remaining trace records to storage (5-second timeout).
- The OpenTelemetry trace provider flushes buffered spans to the collector (5-second timeout).
- The OpenTelemetry metric provider flushes buffered metrics to the collector (5-second timeout).
If any flush step exceeds its timeout, the gateway logs an error and continues with the remaining shutdown steps.
Example configurations
Local development with Jaeger
Start Jaeger with OTLP ingestion:
docker run -d --name jaeger \
-p 16686:16686 \
-p 4318:4318 \
jaegertracing/all-in-one:latestConfigure the gateway to export to Jaeger:
observability:
otel:
enabled: true
endpoint: localhost:4318
insecure: true
service_name: ongoingai-gateway
traces_enabled: true
metrics_enabled: false
sampling_ratio: 1.0After sending traffic through the gateway, open http://localhost:16686 to view
traces in the Jaeger UI.
Production with an OTLP collector
Configure the gateway to export to a remote collector over HTTPS:
observability:
otel:
enabled: true
endpoint: https://otel-collector.internal:4318
service_name: ongoingai-gateway
traces_enabled: true
metrics_enabled: true
sampling_ratio: 0.1
export_timeout_ms: 5000
metric_export_interval_ms: 30000In production, consider reducing sampling_ratio to control trace volume. A
ratio of 0.1 samples 10% of requests. Parent-based sampling ensures that if
an incoming request already carries a sampled trace context, the gateway
respects that decision regardless of the local ratio.
Environment variable quickstart
Enable OTEL export without modifying YAML:
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 \
OTEL_SERVICE_NAME=ongoingai-gateway \
ongoingai serve --config ongoingai.yamlValidation checklist
-
Verify that your OTLP collector is reachable from the gateway host:
Bashcurl -s -o /dev/null -w "%{http_code}" http://localhost:4318/v1/tracesA
405or200response confirms the collector is listening. -
Start the gateway with OTEL enabled:
Bashongoingai serve --config ongoingai.yaml -
Send a proxied request through the gateway:
Bashcurl http://localhost:8080/openai/v1/chat/completions \ -H "Authorization: Bearer OPENAI_API_KEY" \ -H "Content-Type: application/json" \ -d '{"model": "gpt-4o", "messages": [{"role": "user", "content": "hello"}]}'Placeholder:
OPENAI_API_KEY: Your OpenAI API key.
-
Check your collector or tracing UI for spans with service name
ongoingai-gateway.
You should see at least two spans: one for the inbound request and one for the upstream proxy call.
Troubleshooting
Spans do not appear in the collector
- Symptom: The tracing UI shows no data for
ongoingai-gateway. - Cause: OTEL is not enabled, or the collector endpoint is unreachable.
- Fix: Verify that
observability.otel.enabledistrueand that theendpointvalue is reachable from the gateway. Check gateway logs for export timeout errors.
Gateway attributes are missing from spans
- Symptom: Spans appear but lack
gateway.org_id,gateway.workspace_id, and other tenant attributes. - Cause: Gateway auth is not enabled, or the request did not include a valid gateway key.
- Fix: Set
auth.enabled=trueand include a valid gateway key in the request header.
Sampling drops more traces than expected
- Symptom: Only a fraction of requests produce spans.
- Cause:
sampling_ratiois set below1.0. - Fix: Increase
sampling_ratiotoward1.0for higher coverage. A value of1.0samples all requests.
Export timeout errors in gateway logs
- Symptom: Gateway logs contain export timeout errors on shutdown or during operation.
- Cause: The collector is slow to respond, or
export_timeout_msis too low for your network. - Fix: Increase
export_timeout_msor verify collector performance.
Metrics are not exported
- Symptom: Traces appear in the collector but metrics do not.
- Cause:
metrics_enabledisfalse, or the collector does not accept OTLP metrics on the configured endpoint. - Fix: Set
metrics_enabledtotrueand verify that the collector supports OTLP metric ingestion on the same endpoint.
Config validation fails with OTEL settings
- Symptom:
ongoingai config validaterejects the OTEL configuration. - Cause: A required field is empty or a numeric value is out of range.
- Fix: Verify that
endpointandservice_nameare non-empty, thatsampling_ratiois between0.0and1.0, and that timeout values are positive integers.