Skip to content

Running Benchmarks

This page is the runbook for the Sentry benchmark. The overview page is the readout. This page is for creating a new result.

Run one Warden shard per corpus commit. Each shard checks out the public getsentry/sentry repository at that commit and scans only the files referenced by corpus entries for that commit.

Do not run this benchmark against all of Sentry. The point is a repeatable comparison against known vulnerable files.

Clone Sentry and pick a runtime/model pair:

Terminal window
git clone git@github.com:getsentry/sentry.git /tmp/sentry-benchmark
export BENCH_MODEL="openai/gpt-5.5"
export BENCH_MODEL_SLUG="gpt-5-5"
export BENCH_RUNTIME="pi"
export BENCH_RUNTIME_SLUG="pi"
export BENCH_EFFORT="high"
export BENCH_RUN_SLUG="pi-gpt-5-5-high"
export BENCH_ROOT="/tmp/warden-sentry-benchmark-${BENCH_RUN_SLUG}"
mkdir -p "$BENCH_ROOT"

Set the provider API key required by the model. For GPT 5.5 through Pi, set WARDEN_OPENAI_API_KEY. For Anthropic models, set WARDEN_ANTHROPIC_API_KEY. For the Claude SDK runtime, use Claude Code model IDs such as claude-sonnet-4-6 instead of Pi provider/model selectors.

The older GPT 5.5 run used openai/gpt-5.5. The installed Pi registry did not expose openai/gpt-5.5-codex. It did not pass an explicit effort flag, so the run used the runtime/provider default. New benchmark runs should pass --effort high or --effort low when testing reasoning behavior, and omit --effort only when intentionally measuring the runtime default.

From the Warden repository, write one target list per corpus commit:

Terminal window
node <<'NODE'
const {execFileSync} = require("node:child_process");
const {mkdirSync, readFileSync, writeFileSync} = require("node:fs");
const corpus = JSON.parse(
readFileSync("packages/docs/src/data/benchmarking/sentry-vulnerability-corpus.json", "utf8"),
);
const repo = "/tmp/sentry-benchmark";
const outDir = process.env.BENCH_ROOT;
mkdirSync(outDir, {recursive: true});
const bySha = new Map();
for (const finding of corpus.findings) {
const entry = bySha.get(finding.sha) ?? {findings: 0, paths: new Set()};
entry.findings += 1;
entry.paths.add(finding.code.path);
bySha.set(finding.sha, entry);
}
for (const [sha, entry] of [...bySha.entries()].sort()) {
const paths = [...entry.paths].sort();
const missing = [];
for (const path of paths) {
try {
execFileSync("git", ["-C", repo, "cat-file", "-e", `${sha}:${path}`], {stdio: "ignore"});
} catch {
missing.push(path);
}
}
if (missing.length > 0) {
throw new Error(`${sha} missing corpus paths:\n${missing.join("\n")}`);
}
writeFileSync(`${outDir}/targets-${sha}.txt`, `${paths.join("\n")}\n`);
console.error(`${sha}: ${paths.length} target files for ${entry.findings} corpus findings`);
}
NODE

The current corpus produces 79 target files across 6 commits.

Write the config outside the Sentry checkout. Keep runtime, model, and effort explicit on the CLI so the invocation captures the run shape. The config keeps the benchmark skill, thresholds, concurrency, and verifier policy stable.

Terminal window
cat > "$BENCH_ROOT/warden.toml" <<EOF
version = 1
[defaults]
reportOn = "low"
[defaults.verification]
enabled = true
[runner]
concurrency = 4
[[skills]]
name = "security-review"
EOF

Run from the Warden repository:

Terminal window
effort_args=()
if [ -n "${BENCH_EFFORT:-}" ]; then
effort_args=(--effort "$BENCH_EFFORT")
fi
for target in "$BENCH_ROOT"/targets-*.txt; do
sha=${target##*/targets-}
sha=${sha%.txt}
short=${sha:0:8}
git -C /tmp/sentry-benchmark checkout "$sha"
pnpm cli -- run \
-C /tmp/sentry-benchmark \
@"$target" \
--skill security-review \
--config-path "$BENCH_ROOT/warden.toml" \
--runtime "$BENCH_RUNTIME" \
--model "$BENCH_MODEL" \
"${effort_args[@]}" \
--traces \
--report-on low \
--min-confidence low \
--parallel 4 \
-o "$BENCH_ROOT/sentry-security-review-${BENCH_RUN_SLUG}-corpus-${short}.jsonl" \
-v \
--log
done

Inspect the stitched output:

Terminal window
pnpm cli -- runs show "$BENCH_ROOT"/sentry-security-review-"$BENCH_RUN_SLUG"-corpus-*.jsonl \
-C /tmp/sentry-benchmark \
--report-on low \
--min-confidence low

Keep every raw JSONL shard with the result summary, but do not commit raw JSONL until it has been reviewed for sensitive data. The raw artifacts are the source of truth for cost, duration, token counts, and future rescoring.

Store results in packages/docs/src/data/benchmarking/results/.

Record:

  • stable runId
  • corpus ID
  • repository
  • Warden version
  • skill
  • model
  • runtime
  • effort level, or provider-default
  • whether Warden’s post-analysis finding verifier was enabled
  • whether --traces was enabled, plus any run-level trace IDs preserved in the JSONL metadata
  • report and confidence thresholds
  • one shard per corpus commit, including SHA, target list, raw JSONL artifact name, and raw artifact review status
  • total files, chunks, failed chunks, findings, cost, duration, and tokens
  • timing.analysisChunkMs from top-level per-record durationMs values in the raw JSONL artifacts, when all raw shard artifacts are available
  • total wall duration from the stitched run summary
  • scoring summary once a reviewer matches findings back to the corpus

Warden’s finding verifier is enabled by default. Benchmark runs should leave it enabled unless they are deliberately testing verifier-off behavior. It is disabled only when defaults.verification.enabled = false is set in warden.toml. Record this as findingVerification.enabled in the result JSON.

Verifier calls are part of Warden’s analysis pipeline, not benchmark scoring. They can add provider cost, and runs with more candidate findings generally do more verifier work. Keep this separate from the benchmark scores field, which is the later semantic match against the corpus.

The timing breakdown has the same separation. Per-chunk P50 and P90 timing is recorded before Warden’s post-analysis verifier runs. Total timing includes post-analysis work and upstream provider latency, so treat it as flaky operational context rather than a stable comparison metric.

Score by agent-verified semantic match, not exact wording or line number. A result counts as found when it identifies the same bug in roughly the same location as an existing corpus finding.

This is not deterministic. An agent reviews every emitted finding against the existing corpus findings for that commit. Same-file findings about different bugs do not count. One emitted finding may count for multiple corpus entries when it clearly covers the same bug represented by multiple existing entries. Duplicate emitted findings do not double-count the same corpus entry.

Use this scoring checklist:

  • Read every emitted finding from the raw JSONL shards for the run.
  • For each finding, compare it against corpus entries from the same commit.
  • Use same path and nearby line range as the first candidate filter, but make the final decision semantically.
  • Count known-found only when the finding would lead a reviewer to the same bug in roughly the same code location.
  • Mark same-file findings about different bugs as not-known.
  • Record one scores entry for every emitted finding, including non-matches.
  • Set scoring.knownFound to the number of unique matched corpus entries, not the number of emitted findings.
  • Leave the run unscored when the raw findings are missing or cannot be semantically verified.

Keep the distinction clear:

  • known found: corpus vulnerabilities Warden found
  • total findings: all findings Warden emitted before scoring
  • unexpected valid: real vulnerabilities not already in the corpus
  • false positives: findings rejected by review

Do not treat the score as a universal pass rate. It is a relative comparison for this corpus and this run shape.