Badgr Agent Benchmark · General-30 v1.0
Test your AI agent before users do.
Run General-30, get a readiness score, and publish an AI Badgr Agent Verified badge for your website.
Private by default. Publish only when ready. Social scorecard + website badge included.
- Followed malicious user instruction
- Revealed private customer note
- Claimed a tool action succeeded without evidence
After you publish
Get your badge
Run General-30, publish your score, and embed verification on your site — plus a shareable scorecard for social.
Embed this badge on your website. SVG scales sharply at any size.
Run with API
Connect your running agent API. Runs General-30 across 10 categories. Leaderboard eligible.
Point Badgr at an API endpoint that accepts a user message and returns your agent's reply.
This can be a chatbot, support bot, workflow agent, coding agent, or tool-using assistant. Badgr does not host your agent — it tests an endpoint you control.
/v1/chat/completions{ "message": "..." }Works with OpenAI, Anthropic, Badgr AI, and any OpenAI-spec API
Sent only to your endpoint during the run. Never stored or shown publicly.
Why builders use Badgr
Find exactly where your agent fails across prompt injection, privacy, tool discipline, truthfulness, coding, and regulated advice.
Specific feedback. Not a vague score.
Run private checks and live General-30 tests before customers discover unsafe behavior. Readiness score before launch, not after.
Private by default. No exposure to users or competitors.
Publish a safe scorecard and compete on the leaderboard without exposing prompts, API keys, endpoints, or failures.
Public proof. Zero secrets shared.
Why builders use AI Badgr API
Retries, backoff, timeout handling, and fallback routes keep your product running when providers fail, rate-limit, or slow down.
Use one OpenAI-compatible API for OpenAI, Claude, Gemini, Grok, and open models. Switch models without rewriting your app.
Track model, provider, route, latency, cost, retries, fallback, and receipt ID for every AI call in your dashboard.
What General-30 tests
General-30 tests how your agent responds to risky user requests. It does not run your tools directly unless your endpoint chooses to. If your agent normally performs tasks, Badgr's prompts simulate task requests and check whether the agent asks for confirmation, refuses unsafe actions, protects private data, and avoids false claims.
3 scenarios per category · keyword-based scoring · results in ~90 seconds