Rate limiting and quotas

Use this page to configure request and usage guardrails for provider traffic in OngoingAI Gateway. It covers enforcement scope, limit codes, and verification steps.

Enforcement scope

Enforces per-key and per-workspace request rate limits.
Enforces per-key and per-workspace daily token quotas.
Enforces per-key and per-workspace daily cost quotas.
Returns 429 with structured limit codes when a limit is exceeded.
Sets Retry-After for request-rate limit responses.

Operational fit

You need spend control for shared model usage.
You need burst protection against abusive request patterns.
You need fair-share protection across workspaces and keys.

Evaluation order

Limits run on proxied provider routes (/openai/* and /anthropic/*).
Limits use authenticated gateway identity (org_id, workspace_id, key ID) for scope.
Daily quotas are checked first from persisted trace analytics for the current UTC day.
Request-rate limits are then checked with in-memory sliding one-minute windows.
If a limit is exceeded, middleware returns 429 with an error code and message.
For request-rate limits, responses include retry_after_seconds and Retry-After.

Limit codes:

KEY_RATE_LIMIT_EXCEEDED
WORKSPACE_RATE_LIMIT_EXCEEDED
KEY_DAILY_TOKENS_EXCEEDED
WORKSPACE_DAILY_TOKENS_EXCEEDED
KEY_DAILY_COST_EXCEEDED
WORKSPACE_DAILY_COST_EXCEEDED

Limits are enforced only when gateway auth is enabled because limiter scope depends on authenticated identity.

Starter limits config

YAML

auth:
  enabled: true
  header: X-OngoingAI-Gateway-Key
 
limits:
  per_key:
    requests_per_minute: 120
    max_tokens_per_day: 1000000
    max_cost_usd_per_day: 50
  per_workspace:
    requests_per_minute: 500
    max_tokens_per_day: 5000000
    max_cost_usd_per_day: 200

Set limit values greater than 0 to enable each threshold. A value of 0 disables that threshold.

Policy patterns

Start with request-rate limits only, then add daily quotas after usage baselining.
Use stricter per-key limits with a higher per-workspace envelope.
Use workspace daily cost limits as a team budget guardrail.
Keep per-key IDs stable so key-scoped daily quotas can attribute usage correctly.

Example limit profiles

Strict per-key burst control with workspace headroom

YAML

auth:
  enabled: true
  header: X-OngoingAI-Gateway-Key
 
limits:
  per_key:
    requests_per_minute: 30
    max_tokens_per_day: 0
    max_cost_usd_per_day: 0
  per_workspace:
    requests_per_minute: 300
    max_tokens_per_day: 0
    max_cost_usd_per_day: 0

Daily budget ceiling for shared workspace usage

YAML

auth:
  enabled: true
  header: X-OngoingAI-Gateway-Key
 
limits:
  per_key:
    requests_per_minute: 120
    max_tokens_per_day: 1000000
    max_cost_usd_per_day: 25
  per_workspace:
    requests_per_minute: 600
    max_tokens_per_day: 8000000
    max_cost_usd_per_day: 200

Verification steps

Configure a low request-rate threshold for test traffic.

YAML

auth:
  enabled: true
  header: X-OngoingAI-Gateway-Key
 
limits:
  per_key:
    requests_per_minute: 2
    max_tokens_per_day: 0
    max_cost_usd_per_day: 0

Start the gateway in Terminal A.

Bash

ongoingai config validate
ongoingai serve

Send three provider requests quickly from Terminal B.

Bash

for i in 1 2 3; do
  curl -i "http://localhost:8080/openai/v1/models" \
    -H "X-OngoingAI-Gateway-Key: GATEWAY_KEY" \
    -H "Authorization: Bearer $OPENAI_API_KEY"
done

Placeholders:

GATEWAY_KEY: Gateway key token with proxy:write.
OPENAI_API_KEY: Upstream provider API key.

You should see:

First requests pass through to provider behavior.
A 429 response once the per-key rate threshold is exceeded.
Response body fields error, code, and retry_after_seconds.
Retry-After response header for request-rate limits.

Troubleshooting

Limits do not trigger

Symptom: High request volume never returns 429.
Cause: auth.enabled=false, limits are zero, or traffic bypasses provider routes.
Fix: Enable auth, set limits above zero, and route traffic through /openai/* or /anthropic/*.

`429` returns workspace limit code unexpectedly

Symptom: Response code is WORKSPACE_* when per-key limits look low.
Cause: Workspace thresholds are checked independently and may be lower than aggregate key traffic.
Fix: Raise workspace thresholds or redistribute key traffic by workspace.

`429` responses are inconsistent across multiple gateway instances

Symptom: Rate limits trigger differently per request path through load balancers.
Cause: Request-rate counters are in-memory per gateway process.
Fix: Treat current rate limiting as per-instance, or run a single gateway instance for strict global RPM behavior.

Daily quotas lag behind in-flight traffic

Symptom: Daily token or cost limits trigger after a short delay.
Cause: Daily quotas are based on persisted trace data, and trace writes are asynchronous.
Fix: Expect short lag under burst load, and set conservative headroom in quota thresholds.

Proxy responses return `503` for usage limit checks

Symptom: Response error is gateway usage limit check unavailable.
Cause: Trace store analytics query failed during limit evaluation.
Fix: Restore trace storage health and verify storage connectivity.