Eval Harness Playground — Playground

What it does

Points at any inference endpoint (yours or a hosted one), runs a curated set of small but informative evals, and produces a scorecard. The set is small on purpose: you get answers in minutes instead of hours, and the evals are chosen for signal-to-noise rather than headline numbers.

Why it’s useful

Full benchmark suites take hours and cost real money. Most of the time you just want a quick “is this model worth a closer look.” This playground gives you a focused first read so you can decide whether to invest in a full eval pass.

How to use it

Provide an endpoint URL and auth header (or pick a hosted preset).
Select an eval pack — reasoning, code, instruction following, refusal behavior.
Run. Results stream in as each prompt completes.
Export the scorecard or share a permalink.

Limitations

Beta: the eval set is opinionated and will keep evolving.
Not a replacement for full evaluation suites when stakes are high.
Caches results by prompt hash, so re-running the same prompts is free.