Badgr Agent Benchmark · General-30 v1.0

Test your AI agent before users do.

Run General-30, get a readiness score, and publish an AI Badgr Agent Verified badge for your website.

Private by default. Publish only when ready. Social scorecard + website badge included.

Example scorecard · General-30
46/100
Prompt Injection1/2
Privacy1/2
Tool Discipline1/2
Truthfulness2/2
Coding0/2
Top failures
  1. Followed malicious user instruction
  2. Revealed private customer note
  3. Claimed a tool action succeeded without evidence

After you publish

Get your badge

Run General-30, publish your score, and embed verification on your site — plus a shareable scorecard for social.

3Show agent verification on your site

Embed this badge on your website. SVG scales sharply at any size.

AI Badgr Agent Verified·General-30 v1.0 · 68/100

Run with API

Connect your running agent API. Runs General-30 across 10 categories. Leaderboard eligible.

Point Badgr at an API endpoint that accepts a user message and returns your agent's reply.

This can be a chatbot, support bot, workflow agent, coding agent, or tool-using assistant. Badgr does not host your agent — it tests an endpoint you control.

OpenAI-compatible: any endpoint at /v1/chat/completions
Simple JSON: endpoint accepting { "message": "..." }
Badgr AI API: use your Badgr API endpoint if you're running your agent on Badgr

Works with OpenAI, Anthropic, Badgr AI, and any OpenAI-spec API

Sent only to your endpoint during the run. Never stored or shown publicly.

Why builders use Badgr

Stop training blind

Find exactly where your agent fails across prompt injection, privacy, tool discipline, truthfulness, coding, and regulated advice.

Specific feedback. Not a vague score.

Know before users do

Run private checks and live General-30 tests before customers discover unsafe behavior. Readiness score before launch, not after.

Private by default. No exposure to users or competitors.

Prove your agent is ready

Publish a safe scorecard and compete on the leaderboard without exposing prompts, API keys, endpoints, or failures.

Public proof. Zero secrets shared.

Why builders use AI Badgr API

Reliable AI requests

Retries, backoff, timeout handling, and fallback routes keep your product running when providers fail, rate-limit, or slow down.

One API, many models

Use one OpenAI-compatible API for OpenAI, Claude, Gemini, Grok, and open models. Switch models without rewriting your app.

Receipts for every request

Track model, provider, route, latency, cost, retries, fallback, and receipt ID for every AI call in your dashboard.

What General-30 tests

General-30 tests how your agent responds to risky user requests. It does not run your tools directly unless your endpoint chooses to. If your agent normally performs tasks, Badgr's prompts simulate task requests and check whether the agent asks for confirmation, refuses unsafe actions, protects private data, and avoids false claims.

3 scenarios per category · keyword-based scoring · results in ~90 seconds

Prompt Injection
Privacy
Tool Discipline
Truthfulness
Customer Support
Coding
Finance
Healthcare
Legal/HR
Cybersecurity

Leaderboard preview

View full leaderboard →
Loading leaderboard…

Frequently asked questions

What is General-30 and how does it work?
General-30 is a 30-scenario safety benchmark for AI agents. It sends your agent real user-like prompts across 10 categories — from prompt injection to healthcare — and scores each response using keyword-based rules. It does not require access to your system prompt, and your API key is never stored or logged.
How do I connect my agent?
Paste your agent's public HTTPS endpoint URL. If it uses the OpenAI chat format, select the 'OpenAI-compatible' preset. Add your Authorization header if your endpoint requires one. Hit 'Test connection' first — that sends a single hello to confirm it's reachable — then run General-30.
What format does my agent endpoint need to return?
Any JSON response works. Set the 'Response path' under Advanced settings to the field that holds the text reply — for example choices.0.message.content for OpenAI, or answer for a simple JSON object. The test connection step will confirm the path resolves correctly.
Is my prompt, API key, or endpoint stored?
No. Your API key and endpoint URL are sent to our servers only to run the benchmark and are never stored in the report. The report stores only the scenario message, agent response excerpt (up to 2000 chars), and score. Your prompt is never stored in any form.
What is the Prompt Check tab?
Prompt Check analyzes your system prompt text against 6 safety categories without calling any endpoint. It's a free, instant way to spot gaps before running the full General-30 benchmark. Results are always private and not leaderboard eligible.
How is the score calculated?
Each scenario is scored pass (1 point), partial (0.5 points), or fail (0 points) using keyword-based rules tuned per category. Overall score = total points ÷ max points × 100. Timeout or empty responses count as fails.
Why are some scenarios timing out?
If your endpoint takes longer than 60 seconds to respond, the request is retried once automatically. Persistent timeouts usually mean the underlying model is rate-limited or slow — try a faster model or reduce concurrency on your side.
Can I publish my results to the leaderboard?
Yes — after completing General-30 with a live endpoint, you can opt in to publish. Only your agent name, score, badge, and optional website are shown publicly. Your endpoint, API key, prompt, and failed responses are never shared.
What is the limit on free runs?
Each account gets 1 free Prompt Check and 1 free General-30 endpoint run per day. The two limits are tracked separately — using Prompt Check does not use your daily benchmark run.
What comes after General-30?
General-100 is coming next — 100 scenarios across all 10 categories with deeper coverage of edge cases, adversarial inputs, and multi-turn safety. It will be available for teams that need a more rigorous certification.