Running Benchmarks
This page is the runbook for the Sentry benchmark. The overview page is the readout. This page is for creating a new result.
Run one Warden shard per corpus commit. Each shard checks out the public
getsentry/sentry repository at that commit and scans only the files referenced
by corpus entries for that commit.
Do not run this benchmark against all of Sentry. The point is a repeatable comparison against known vulnerable files.
Clone Sentry and pick a runtime/model pair:
git clone git@github.com:getsentry/sentry.git /tmp/sentry-benchmark
export BENCH_MODEL="openai/gpt-5.5"export BENCH_MODEL_SLUG="gpt-5-5"export BENCH_RUNTIME="pi"export BENCH_RUNTIME_SLUG="pi"export BENCH_EFFORT="high"export BENCH_RUN_SLUG="pi-gpt-5-5-high"export BENCH_ROOT="/tmp/warden-sentry-benchmark-${BENCH_RUN_SLUG}"mkdir -p "$BENCH_ROOT"Set the provider API key required by the model. For GPT 5.5 through Pi, set
WARDEN_OPENAI_API_KEY. For Anthropic models, set WARDEN_ANTHROPIC_API_KEY.
For the Claude SDK runtime, use Claude Code model IDs such as
claude-sonnet-4-6 instead of Pi provider/model selectors.
The older GPT 5.5 run used openai/gpt-5.5. The installed Pi registry did not
expose openai/gpt-5.5-codex. It did not pass an explicit effort flag, so the
run used the runtime/provider default. New benchmark runs should pass
--effort high or --effort low when testing reasoning behavior, and omit
--effort only when intentionally measuring the runtime default.
Target Lists
Section titled “Target Lists”From the Warden repository, write one target list per corpus commit:
node <<'NODE'const {execFileSync} = require("node:child_process");const {mkdirSync, readFileSync, writeFileSync} = require("node:fs");
const corpus = JSON.parse( readFileSync("packages/docs/src/data/benchmarking/sentry-vulnerability-corpus.json", "utf8"),);const repo = "/tmp/sentry-benchmark";const outDir = process.env.BENCH_ROOT;mkdirSync(outDir, {recursive: true});
const bySha = new Map();for (const finding of corpus.findings) { const entry = bySha.get(finding.sha) ?? {findings: 0, paths: new Set()}; entry.findings += 1; entry.paths.add(finding.code.path); bySha.set(finding.sha, entry);}
for (const [sha, entry] of [...bySha.entries()].sort()) { const paths = [...entry.paths].sort(); const missing = []; for (const path of paths) { try { execFileSync("git", ["-C", repo, "cat-file", "-e", `${sha}:${path}`], {stdio: "ignore"}); } catch { missing.push(path); } } if (missing.length > 0) { throw new Error(`${sha} missing corpus paths:\n${missing.join("\n")}`); } writeFileSync(`${outDir}/targets-${sha}.txt`, `${paths.join("\n")}\n`); console.error(`${sha}: ${paths.length} target files for ${entry.findings} corpus findings`);}NODEThe current corpus produces 79 target files across 6 commits.
Warden Config
Section titled “Warden Config”Write the config outside the Sentry checkout. Keep runtime, model, and effort explicit on the CLI so the invocation captures the run shape. The config keeps the benchmark skill, thresholds, concurrency, and verifier policy stable.
cat > "$BENCH_ROOT/warden.toml" <<EOFversion = 1
[defaults]reportOn = "low"
[defaults.verification]enabled = true
[runner]concurrency = 4
[[skills]]name = "security-review"EOFRun from the Warden repository:
effort_args=()if [ -n "${BENCH_EFFORT:-}" ]; then effort_args=(--effort "$BENCH_EFFORT")fi
for target in "$BENCH_ROOT"/targets-*.txt; do sha=${target##*/targets-} sha=${sha%.txt} short=${sha:0:8}
git -C /tmp/sentry-benchmark checkout "$sha"
pnpm cli -- run \ -C /tmp/sentry-benchmark \ @"$target" \ --skill security-review \ --config-path "$BENCH_ROOT/warden.toml" \ --runtime "$BENCH_RUNTIME" \ --model "$BENCH_MODEL" \ "${effort_args[@]}" \ --traces \ --report-on low \ --min-confidence low \ --parallel 4 \ -o "$BENCH_ROOT/sentry-security-review-${BENCH_RUN_SLUG}-corpus-${short}.jsonl" \ -v \ --logdoneInspect the stitched output:
pnpm cli -- runs show "$BENCH_ROOT"/sentry-security-review-"$BENCH_RUN_SLUG"-corpus-*.jsonl \ -C /tmp/sentry-benchmark \ --report-on low \ --min-confidence lowRecord Results
Section titled “Record Results”Keep every raw JSONL shard with the result summary, but do not commit raw JSONL until it has been reviewed for sensitive data. The raw artifacts are the source of truth for cost, duration, token counts, and future rescoring.
Store results in packages/docs/src/data/benchmarking/results/.
Record:
- stable
runId - corpus ID
- repository
- Warden version
- skill
- model
- runtime
- effort level, or
provider-default - whether Warden’s post-analysis finding verifier was enabled
- whether
--traceswas enabled, plus any run-level trace IDs preserved in the JSONL metadata - report and confidence thresholds
- one shard per corpus commit, including SHA, target list, raw JSONL artifact name, and raw artifact review status
- total files, chunks, failed chunks, findings, cost, duration, and tokens
timing.analysisChunkMsfrom top-level per-recorddurationMsvalues in the raw JSONL artifacts, when all raw shard artifacts are available- total wall duration from the stitched run summary
- scoring summary once a reviewer matches findings back to the corpus
Warden’s finding verifier is enabled by default. Benchmark runs should leave it
enabled unless they are deliberately testing verifier-off behavior. It is
disabled only when defaults.verification.enabled = false is set in
warden.toml. Record this as findingVerification.enabled in the result JSON.
Verifier calls are part of Warden’s analysis pipeline, not benchmark scoring.
They can add provider cost, and runs with more candidate findings generally do
more verifier work. Keep this separate from the benchmark scores field, which
is the later semantic match against the corpus.
The timing breakdown has the same separation. Per-chunk P50 and P90 timing is recorded before Warden’s post-analysis verifier runs. Total timing includes post-analysis work and upstream provider latency, so treat it as flaky operational context rather than a stable comparison metric.
Score by agent-verified semantic match, not exact wording or line number. A result counts as found when it identifies the same bug in roughly the same location as an existing corpus finding.
This is not deterministic. An agent reviews every emitted finding against the existing corpus findings for that commit. Same-file findings about different bugs do not count. One emitted finding may count for multiple corpus entries when it clearly covers the same bug represented by multiple existing entries. Duplicate emitted findings do not double-count the same corpus entry.
Use this scoring checklist:
- Read every emitted finding from the raw JSONL shards for the run.
- For each finding, compare it against corpus entries from the same commit.
- Use same path and nearby line range as the first candidate filter, but make the final decision semantically.
- Count
known-foundonly when the finding would lead a reviewer to the same bug in roughly the same code location. - Mark same-file findings about different bugs as
not-known. - Record one
scoresentry for every emitted finding, including non-matches. - Set
scoring.knownFoundto the number of unique matched corpus entries, not the number of emitted findings. - Leave the run unscored when the raw findings are missing or cannot be semantically verified.
Keep the distinction clear:
- known found: corpus vulnerabilities Warden found
- total findings: all findings Warden emitted before scoring
- unexpected valid: real vulnerabilities not already in the corpus
- false positives: findings rejected by review
Do not treat the score as a universal pass rate. It is a relative comparison for this corpus and this run shape.