What engineers usually see

•Long delay between request start and first token
•User sees loading spinner for unpredictable duration
•No visibility into what's causing the delay
•Could be queuing, cold start, or rate limiting

Why this is hard to debug

First token latency doesn't get logged separately from total request time, so you can't tell if the delay was from waiting in queue, model cold start, or actual inference. Makes it pretty much impossible to optimize anything.

Minimal repro

import time
from openai import OpenAI

client = OpenAI(
    api_key="YOUR_OPENAI_KEY",
    base_url="https://aibadgr.com/v1"
)

start = time.time()
stream = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "test"}],
    stream=True
)

first_chunk = next(stream)
print(f"TTFB: {time.time() - start}s")

This request routes through AI Badgr and returns a stable request ID that links to an execution record.

Note: AI Badgr is OpenAI-compatible and works as a drop-in proxy. No SDK changes required — only the base_url changes.

What a per-request execution record makes visible

Time to first byte (TTFB)
Time to first token specifically
Queue wait time (if any)
Provider routing time
Breakdown of pre-inference latency

Run 1 request → get receipt

Change your base URL to https://aibadgr.com/v1 and run your request.

The response includes an X-Badgr-Request-Id header that links to a receipt showing latency, retries, tokens, cost, and failure stage for that specific execution.

Get API Key

Not the engineer?
Share this page with your dev and ask them to run one request through AI Badgr. That's all that's needed to get the receipt.

This kind of thing only makes sense when you can actually see what happened to a single request from start to finish, instead of trying to piece it together from scattered logs.

First token delay in LLM requests

What engineers usually see

Why this is hard to debug

Minimal repro

What a per-request execution record makes visible

Run 1 request → get receipt