GPT 5.5 (Pi)
high
The Sentry benchmark is a small, qualitative readout for Warden’s security
review behavior. It compares runs against known vulnerabilities from the public
getsentry/sentry repository.
This is not an exhaustive eval and it is not a proof that Warden will catch every future issue. It is a way to compare implementations, prompts, models, and runtimes against the same historical security corpus.
The corpus currently contains 86 validated vulnerabilities across 79 files and 6 historical Sentry commits. A benchmark run checks out each commit and scans only the files tied to known vulnerabilities at that commit.
That keeps the run focused. We are measuring whether Warden can recognize the same root causes, not whether it can discover unrelated issues across the whole Sentry repository.
The score table is the headline. The cost and timing tables below it are operational context for understanding why two runs with similar scores may look very different to operate. This matrix only shows stable comparison runs with per-chunk timing metadata and no failed chunks; older incomplete or partial runs remain in the result data but are hidden here.
high
low
medium
high
medium
default
high
medium
high
low
medium
high
medium
default
high
medium
high
low
medium
high
medium
default
high
medium
Known found is the useful number. It counts corpus entries where an agent verified that Warden found the same bug in roughly the same location as an existing corpus finding. Exact wording, line numbers, and exploit framing can drift.
Scoring is a review judgment, not a deterministic formula. Same-file findings
about different bugs do not count. One emitted finding can count for more than
one corpus entry when it clearly covers multiple existing entries for the same
bug, and duplicate emitted findings do not double-count the same corpus entry.
Result JSON files with scores include the per-finding agent verification
records used for that row.
Benchmark runs use Warden’s post-analysis finding verifier unless the run explicitly opts out. That verifier is separate from benchmark scoring: it runs during Warden analysis to filter candidate findings, while scoring later checks whether each emitted finding semantically matches an existing corpus entry. Verifier calls add provider cost, and runs that produce more findings generally cost more because there is more verifier work to do.
Total findings is the amount of review output Warden produced before scoring. A higher number can be good if it finds more real vulnerabilities, but it also means more human review.
Recorded cost is the provider-reported cost persisted in the result metadata,
not cost per finding. The cost table shows one recorded-cost column plus the
persisted input and output token totals. Recorded cost can include Warden’s
post-analysis verifier and other auxiliary model calls, but those calls may use
auxiliary or synthesis models rather than the model being benchmarked. Because
of that, auxiliary cost is operational context, not a useful comparison
dimension for this matrix. The raw JSONL logs are kept outside the docs until
they have been reviewed for sensitive data. Current displayed runs scan the
same 156 analysis chunks with zero failed chunks. Rows with failed chunks stay
out of the stable matrix until they are rerun or explicitly recorded as partial.
When the raw artifacts preserve verifier usage, it is included under
auxiliaryUsage.verification; some run shapes only persist the final total or
per-chunk analysis usage.
Some run shapes persist only per-chunk analysis usage. The Opus 4.6 high-effort Pi run completed with no failed chunks, but its live CLI shard summaries showed approximately $38.43 total including auxiliary/post-processing work while the persisted JSONL artifacts contain $30.89 of scan cost. Until the artifact format preserves that auxiliary usage exactly, the table uses the persisted JSONL cost and the run note records the gap. Treat recorded cost as operational accounting for a row, not normalized model pricing.
Treat duration as an operational measurement, not a stable model quality
metric. P50 and P90 come from per-analysis-chunk durationMs records in the
raw JSONL artifacts when those artifacts are available. Total is included as
operational context, but it is flaky: it includes post-analysis work such as
finding verification, upstream provider latency, queueing, retries, and
transient service reliability. That matters most when comparing the same model
across different runtimes.
Treat cost the same way. It is useful for operating Warden, but it is not a normalized model-efficiency metric. Provider defaults, runtime defaults, explicit reasoning effort, cache behavior, output verbosity, retries, runtime accounting, and Warden’s finding verifier can all move the total even when the corpus and target files are identical.
Pi runs without an explicit Warden --effort use Pi’s default thinking level,
which is currently medium.
The Opus 4.7 and 4.8 Pi runs currently have an unusual shape. They complete cleanly, but many no-finding chunks are very short. In the traced Opus 4.8 Pi rerun, chunks with findings averaged 4.3 turns and 3.1k output tokens, while no-finding chunks averaged 1.8 turns and 824 output tokens. Eighty-one of 137 no-finding chunks were one-turn scans, and 32 of 68 missed corpus entries were covered by one-turn no-finding scans. The Opus 4.7 run predates trace capture, but its low token use and short timings show a similar pattern.
Treat this as a hypothesis, not a model diagnosis. Through Pi, the traces suggest many no-finding chunks terminate early. The miss pattern is consistent with under-exploration of cross-file authorization and data-boundary invariants, not a proven inability to reason about a found issue.
In the current clean, agent-verified rows, GPT 5.5 on Pi found 41 of 86 with explicit high effort and 28 of 86 with explicit low effort. Sonnet 4.6 found 25 of 86 on Pi at Pi’s default level. Opus 4.6 on Pi with explicit high effort found 23 of 86. Opus 4.8 found 18 of 86 on Pi at Pi’s default level, 14 of 86 on Pi with explicit high effort, and 17 of 86 through the Claude SDK. Opus 4.7 on Pi found 6 of 86 at Pi’s default level.
A Claude SDK Sonnet 4.6 run found 23 of 86, but one chunk failed, so it is kept in the result data and omitted from the stable matrix until rerun cleanly.
Use those numbers as a relative comparison for this corpus. They are not a general pass rate for Sentry.
The Sentry vulnerability corpus lists the known issues used for scoring. Each entry includes the repository SHA, the affected file, a short vulnerability description, and the relevant code snippet.
Use the running guide to reproduce the benchmark, add a new model run, and record sanitized result metadata.