Skip to content

Overview

The Sentry benchmark is a small, qualitative readout for Warden’s security review behavior. It compares runs against known vulnerabilities from the public getsentry/sentry repository.

This is not an exhaustive eval and it is not a proof that Warden will catch every future issue. It is a way to compare implementations, prompts, models, and runtimes against the same historical security corpus.

The corpus currently contains 86 validated vulnerabilities across 79 files and 6 historical Sentry commits. A benchmark run checks out each commit and scans only the files tied to known vulnerabilities at that commit.

That keeps the run focused. We are measuring whether Warden can recognize the same root causes, not whether it can discover unrelated issues across the whole Sentry repository.

The score table is the headline. The cost and timing tables below it are operational context for understanding why two runs with similar scores may look very different to operate. This matrix only shows stable comparison runs with per-chunk timing metadata and no failed chunks; older incomplete or partial runs remain in the result data but are hidden here.

Run Known Findings Recorded Cost

GPT 5.5 (Pi)

high

Known corpus 41/86 47.7%
Total findings 72
Recorded cost $148.63

GPT 5.5 (Pi)

low

Known corpus 28/86 32.6%
Total findings 38
Recorded cost $39.36

Claude Sonnet 4.6 (Pi)

medium

Known corpus 25/86 29.1%
Total findings 32
Recorded cost $19.84

Claude Opus 4.6 (Pi)

high

Known corpus 23/86 26.7%
Total findings 31
Recorded cost $30.89

Claude Opus 4.8 (Pi)

medium

Known corpus 18/86 20.9%
Total findings 19
Recorded cost $14.50

Claude Opus 4.8 (Claude SDK)

default

Known corpus 17/86 19.8%
Total findings 19
Recorded cost $62.26

Claude Opus 4.8 (Pi)

high

Known corpus 14/86 16.3%
Total findings 16
Recorded cost $17.58

Claude Opus 4.7 (Pi)

medium

Known corpus 6/86 7.0%
Total findings 7
Recorded cost $4.39

Cost and Tokens

Run Recorded Cost Input Tokens Output Tokens

GPT 5.5 (Pi)

high

Recorded cost $148.63
Input tokens 127.9m
Output tokens 986.84k

GPT 5.5 (Pi)

low

Recorded cost $39.36
Input tokens 18.71m
Output tokens 390.01k

Claude Sonnet 4.6 (Pi)

medium

Recorded cost $19.84
Input tokens 9.67m
Output tokens 508.84k

Claude Opus 4.6 (Pi)

high

Recorded cost $30.89
Input tokens 12.61m
Output tokens 501.01k

Claude Opus 4.8 (Pi)

medium

Recorded cost $14.50
Input tokens 4.62m
Output tokens 225.33k

Claude Opus 4.8 (Claude SDK)

default

Recorded cost $62.26
Input tokens 23.32m
Output tokens 362.97k

Claude Opus 4.8 (Pi)

high

Recorded cost $17.58
Input tokens 5.13m
Output tokens 318.05k

Claude Opus 4.7 (Pi)

medium

Recorded cost $4.39
Input tokens 1.53m
Output tokens 20.77k

Timing

Run P50 P90 Total

GPT 5.5 (Pi)

high

P50 3.0m
P90 5.6m
Total 163.9m

GPT 5.5 (Pi)

low

P50 34.2s
P90 56.4s
Total 55.2m

Claude Sonnet 4.6 (Pi)

medium

P50 41.9s
P90 1.9m
Total 53.6m

Claude Opus 4.6 (Pi)

high

P50 52.0s
P90 2.7m
Total 75.5m

Claude Opus 4.8 (Pi)

medium

P50 11.9s
P90 51.7s
Total 42.4m

Claude Opus 4.8 (Claude SDK)

default

P50 22.2s
P90 1.2m
Total 45.0m

Claude Opus 4.8 (Pi)

high

P50 20.4s
P90 1.2m
Total 84.2m

Claude Opus 4.7 (Pi)

medium

P50 1.2s
P90 9.6s
Total 6.6m

Known found is the useful number. It counts corpus entries where an agent verified that Warden found the same bug in roughly the same location as an existing corpus finding. Exact wording, line numbers, and exploit framing can drift.

Scoring is a review judgment, not a deterministic formula. Same-file findings about different bugs do not count. One emitted finding can count for more than one corpus entry when it clearly covers multiple existing entries for the same bug, and duplicate emitted findings do not double-count the same corpus entry. Result JSON files with scores include the per-finding agent verification records used for that row.

Benchmark runs use Warden’s post-analysis finding verifier unless the run explicitly opts out. That verifier is separate from benchmark scoring: it runs during Warden analysis to filter candidate findings, while scoring later checks whether each emitted finding semantically matches an existing corpus entry. Verifier calls add provider cost, and runs that produce more findings generally cost more because there is more verifier work to do.

Total findings is the amount of review output Warden produced before scoring. A higher number can be good if it finds more real vulnerabilities, but it also means more human review.

Recorded cost is the provider-reported cost persisted in the result metadata, not cost per finding. The cost table shows one recorded-cost column plus the persisted input and output token totals. Recorded cost can include Warden’s post-analysis verifier and other auxiliary model calls, but those calls may use auxiliary or synthesis models rather than the model being benchmarked. Because of that, auxiliary cost is operational context, not a useful comparison dimension for this matrix. The raw JSONL logs are kept outside the docs until they have been reviewed for sensitive data. Current displayed runs scan the same 156 analysis chunks with zero failed chunks. Rows with failed chunks stay out of the stable matrix until they are rerun or explicitly recorded as partial. When the raw artifacts preserve verifier usage, it is included under auxiliaryUsage.verification; some run shapes only persist the final total or per-chunk analysis usage.

Some run shapes persist only per-chunk analysis usage. The Opus 4.6 high-effort Pi run completed with no failed chunks, but its live CLI shard summaries showed approximately $38.43 total including auxiliary/post-processing work while the persisted JSONL artifacts contain $30.89 of scan cost. Until the artifact format preserves that auxiliary usage exactly, the table uses the persisted JSONL cost and the run note records the gap. Treat recorded cost as operational accounting for a row, not normalized model pricing.

Treat duration as an operational measurement, not a stable model quality metric. P50 and P90 come from per-analysis-chunk durationMs records in the raw JSONL artifacts when those artifacts are available. Total is included as operational context, but it is flaky: it includes post-analysis work such as finding verification, upstream provider latency, queueing, retries, and transient service reliability. That matters most when comparing the same model across different runtimes.

Treat cost the same way. It is useful for operating Warden, but it is not a normalized model-efficiency metric. Provider defaults, runtime defaults, explicit reasoning effort, cache behavior, output verbosity, retries, runtime accounting, and Warden’s finding verifier can all move the total even when the corpus and target files are identical.

Pi runs without an explicit Warden --effort use Pi’s default thinking level, which is currently medium.

The Opus 4.7 and 4.8 Pi runs currently have an unusual shape. They complete cleanly, but many no-finding chunks are very short. In the traced Opus 4.8 Pi rerun, chunks with findings averaged 4.3 turns and 3.1k output tokens, while no-finding chunks averaged 1.8 turns and 824 output tokens. Eighty-one of 137 no-finding chunks were one-turn scans, and 32 of 68 missed corpus entries were covered by one-turn no-finding scans. The Opus 4.7 run predates trace capture, but its low token use and short timings show a similar pattern.

Treat this as a hypothesis, not a model diagnosis. Through Pi, the traces suggest many no-finding chunks terminate early. The miss pattern is consistent with under-exploration of cross-file authorization and data-boundary invariants, not a proven inability to reason about a found issue.

In the current clean, agent-verified rows, GPT 5.5 on Pi found 41 of 86 with explicit high effort and 28 of 86 with explicit low effort. Sonnet 4.6 found 25 of 86 on Pi at Pi’s default level. Opus 4.6 on Pi with explicit high effort found 23 of 86. Opus 4.8 found 18 of 86 on Pi at Pi’s default level, 14 of 86 on Pi with explicit high effort, and 17 of 86 through the Claude SDK. Opus 4.7 on Pi found 6 of 86 at Pi’s default level.

A Claude SDK Sonnet 4.6 run found 23 of 86, but one chunk failed, so it is kept in the result data and omitted from the stable matrix until rerun cleanly.

Use those numbers as a relative comparison for this corpus. They are not a general pass rate for Sentry.

The Sentry vulnerability corpus lists the known issues used for scoring. Each entry includes the repository SHA, the affected file, a short vulnerability description, and the relevant code snippet.

Use the running guide to reproduce the benchmark, add a new model run, and record sanitized result metadata.