Overview

The Sentry benchmark is a small, qualitative readout for Warden’s security review behavior. It compares runs against known vulnerabilities from the public getsentry/sentry repository.

This is not an exhaustive eval and it is not a proof that Warden will catch every future issue. It is a way to compare implementations, prompts, models, and runtimes against the same historical security corpus.

What It Is

The corpus currently contains 86 validated vulnerabilities across 79 files and 6 historical Sentry commits. A benchmark run checks out each commit and scans only the files tied to known vulnerabilities at that commit.

That keeps the run focused. We are measuring whether Warden can recognize the same root causes, not whether it can discover unrelated issues across the whole Sentry repository.

Comparison Matrix

The score table is the headline and sorts by known-corpus recall. The cost table ranks runs by recorded provider cost per known corpus finding. This matrix only includes complete runs with no failed chunks and per-chunk trace data.

Timing remains in result metadata for diagnostics, but the overview does not rank it. Provider load, network conditions, retries, queueing, and the benchmark host can materially change wall time. Recall and cost are the stable comparison signals.

Run Known Findings Cost Cost/known

GPT 5.5 (Pi)

high

Known corpus 41/86 47.7%

Total findings 72

Cost $148.63

Cost per known find $3.63

GPT 5.6 Luna (Pi)

high

Known corpus 36/86 41.9%

Total findings 57

Cost $56.98

Cost per known find $1.58

Grok 4.5 (Pi)

high

Known corpus 33/86 38.4%

Total findings 41

Cost $38.65

Cost per known find $1.17

GPT 5.5 (Pi)

low

Known corpus 28/86 32.6%

Total findings 38

Cost $39.36

Cost per known find $1.41

Claude Sonnet 4.6 (Pi)

Known corpus 25/86 29.1%

Total findings 32

Cost $19.84

Cost per known find $0.79

Claude Sonnet 4.6 (Claude SDK)

Known corpus 24/86 27.9%

Total findings 32

Cost $103.59

Cost per known find $4.32

Claude Opus 4.6 (Pi)

high

Known corpus 23/86 26.7%

Total findings 24

Cost $36.86

Cost per known find $1.60

DeepSeek V4 Pro (Pi)

xhigh

Known corpus 23/86 26.7%

Total findings 30

Cost $18.70

Cost per known find $0.81

Claude Sonnet 5 (Pi)

Known corpus 22/86 25.6%

Total findings 27

Cost $23.46

Cost per known find $1.07

Claude Opus 4.8 (Pi)

high

Known corpus 21/86 24.4%

Total findings 24

Cost $21.31

Cost per known find $1.01

Claude Opus 4.8 (Pi)

medium

Known corpus 18/86 20.9%

Total findings 19

Cost $14.50

Cost per known find $0.81

DeepSeek V4 Flash (Pi)

xhigh

Known corpus 18/86 20.9%

Total findings 27

Cost $10.11

Cost per known find $0.56

Claude Opus 4.8 (Claude SDK)

high

Known corpus 17/86 19.8%

Total findings 17

Cost $79.56

Cost per known find $4.68

GLM 5.2 (Pi)

high

Known corpus 15/86 17.4%

Total findings 18

Cost $5.26

Cost per known find $0.35

Claude Opus 4.7 (Pi)

medium

Known corpus 6/86 7.0%

Total findings 7

Cost $4.39

Cost per known find $0.73

Cost

Lowest cost per known corpus finding first.

Run Cost/known Total cost Input Output

GLM 5.2 (Pi)

high

Cost per known find $0.35

Total cost $5.26

Input 8.37m

Output 426.24k

DeepSeek V4 Flash (Pi)

xhigh

Cost per known find $0.56

Total cost $10.11

Input 74.35m

Output 2.09m

Claude Opus 4.7 (Pi)

medium

Cost per known find $0.73

Total cost $4.39

Input 1.53m

Output 20.77k

Claude Sonnet 4.6 (Pi)

Cost per known find $0.79

Total cost $19.84

Input 9.67m

Output 508.84k

Claude Opus 4.8 (Pi)

medium

Cost per known find $0.81

Total cost $14.50

Input 4.62m

Output 225.33k

DeepSeek V4 Pro (Pi)

xhigh

Cost per known find $0.81

Total cost $18.70

Input 65.51m

Output 1.85m

Claude Opus 4.8 (Pi)

high

Cost per known find $1.01

Total cost $21.31

Input 6.52m

Output 376.36k

Claude Sonnet 5 (Pi)

Cost per known find $1.07

Total cost $23.46

Input 20.37m

Output 800.84k

Grok 4.5 (Pi)

high

Cost per known find $1.17

Total cost $38.65

Input 28.81m

Output 1.66m

GPT 5.5 (Pi)

low

Cost per known find $1.41

Total cost $39.36

Input 18.71m

Output 390.01k

GPT 5.6 Luna (Pi)

high

Cost per known find $1.58

Total cost $56.98

Input 150.23m

Output 1.56m

Claude Opus 4.6 (Pi)

high

Cost per known find $1.60

Total cost $36.86

Input 16.14m

Output 585.84k

GPT 5.5 (Pi)

high

Cost per known find $3.63

Total cost $148.63

Input 127.9m

Output 986.84k

Claude Sonnet 4.6 (Claude SDK)

Cost per known find $4.32

Total cost $103.59

Input 65.67m

Output 1.09m

Claude Opus 4.8 (Claude SDK)

high

Cost per known find $4.68

Total cost $79.56

Input 31.84m

Output 386.17k

Reading Results

Known is the headline score: corpus entries where scoring confirmed the same bug in roughly the same location.
Findings is review volume before benchmark scoring. More findings can improve recall, but they also create more review work.
Cost is the recorded provider cost for the run.
Cost/known divides total cost by unique known corpus findings. Lower is more cost-efficient on this corpus. It does not include reviewer time.
Scoring is semantic. Same-file findings about different bugs do not count, duplicates do not double-count, and one finding can cover multiple corpus entries when it catches the same root bug.

Analysis

GPT 5.6 Luna High

GPT 5.6 Luna high uses Pi directly as openai/gpt-5.6-luna. It found 36 of 86 known entries and emitted 57 findings. This puts it behind GPT 5.5 high and ahead of Grok 4.5 high on known-corpus recall. Thirty-three emitted findings matched 36 unique corpus entries. The other 24 findings described different bugs and did not count.

The run cost $56.98, or $1.58 per known corpus finding. All 156 final chunks completed with traces. One MS Teams chunk lost its terminal OpenAI stream event in the main shard. A parallelism-1 repair replaced that exact chunk, and the recorded cost includes both the main shard and the full repair run.

Grok 4.5 High

Grok 4.5 high uses Pi through OpenRouter as openrouter/x-ai/grok-4.5. It found 33 of 86 known entries and emitted 41 findings, placing it second on known-corpus recall behind GPT 5.5 high. Thirty-two emitted findings matched 33 unique corpus entries because one Atlassian JWT finding covered two entries. The other nine findings did not count because they described different bugs in the same files.

The matched findings cover a broad range of security boundaries. Grok found cross-project and cross-organization authorization gaps, OAuth and JWT validation problems, unsigned or replayable webhooks, exposed credentials, and four client-side injection issues. It also caught narrower logic errors, including the account-merge expiry bypass, pinned-search ownership issue, and workflow detector disconnect authorization gap.

Grok cost $38.65, nearly the same as GPT 5.5 low at $39.36, while finding five more known entries. Post-processing and verification cost $11.77 because the run produced more candidate findings. All 156 chunks completed with traces and no repair runs.

The full-run timing is not comparable. macOS entered idle sleep during the fifth shard and stayed in sleep or dark wake through the final shard. The first four shards completed before sustained sleep and covered 90 chunks with a 62.3-second P50, 3.4-minute P90, and 5.4-minute maximum. Grok was not fast, but the recorded 217.5-minute total and 81.6-minute maximum chunk were inflated by the sleeping benchmark host. The public comparison does not rank timing.

Sonnet 5 High on Pi

Sonnet 5 high found 22 of 86 known entries and emitted 27 findings. That makes it competitive, but not better than Sonnet 4.6 high on this corpus. It costs more than Sonnet 4.6 on Pi, emits fewer final findings, and trails Sonnet 4.6 by three known matches.

Sonnet 4.6: Claude SDK vs Pi

This is the clearest runtime comparison. Pi found 25 of 86 known entries. The Claude SDK found 24 of 86. Both emitted 32 findings.

The quality result is close. The operating profile is not. Claude SDK recorded $103.59 total cost, compared to $19.84 for Pi. The trace summaries point to larger repeated context in Claude SDK runs, not a matching gain in recall.

Opus 4.8 High: Claude SDK vs Pi

Pi found 21 of 86 known entries and emitted 24 findings. Claude SDK found 17 and emitted 17. Pi was also cheaper: $21.31 total versus $79.56.

The trace shape differs from Sonnet 4.6. Pi does more turns and more tool executions here, but each turn carries much less input context. Claude SDK’s extra cost is mostly context volume, not more tool fanout.

Opus 4.8 High vs Opus 4.6 High

The direct Pi comparison favors Opus 4.6 high on recall. Opus 4.6 found 23 of 86 known entries. Opus 4.8 found 21. Both emitted 24 findings.

Opus 4.8 is more selective under the current prompt and corpus. It exits more investigations earlier, which lowers cost and tool fanout, but it misses enough known vulnerabilities to trail Opus 4.6 here.

DeepSeek V4 XHigh

DeepSeek V4 Pro found 23 of 86 known entries and emitted 30 findings. V4 Flash found 18 and emitted 27.

Flash is cheaper because the model price is lower, not because it does less work. It used more turns, more tool executions, and more scan input tokens than V4 Pro. The result is not just a cheaper Opus-shaped run; it explores much more context and lands on a different set of known findings.

GLM 5.2 High

GLM 5.2 uses Pi through OpenRouter as openrouter/z-ai/glm-5.2 with explicit --effort high. OpenRouter reports high as the model’s default reasoning effort, with xhigh also available. The recorded row scans the same 156 chunks and leaves Warden’s finding verifier enabled. It found 15 of 86 known corpus entries and emitted 18 total findings.

The main result is lower recall, not noisy output. Fifteen of the 18 emitted findings matched known corpus entries. The three non-matches were same-file or nearby security findings that did not match the corpus issue: a LaunchDarkly timing-unsafe compare rather than the Statsig timestamp freshness bug, a Bitbucket forwarded-IP/signature bypass rather than invalid-signature HMAC logging, and a Sentry App issue-link SSRF rather than the event-scope corpus issue.

Operationally, GLM 5.2 exposed a Warden compatibility problem. Many clean no-finding chunks returned prose instead of the required {"findings":[]} JSON. Those records had traces, usage, and zero findings, but Warden marked them as extraction_no_findings_json. Four shards therefore use combined-clean artifacts: traced zero-finding extraction failures were normalized to empty ok chunks, and targeted repair records were used where reruns produced cleaner records. One large seer_rpc.py chunk also exceeded OpenRouter’s effective 1M-token context limit in the full shard; rerunning the failed target set with --parallel 1 removed the context failure.

Recorded cost for the validated artifacts is $5.26: $4.94 scan cost plus $0.32 post-processing and verification overhead. That excludes the abandoned xhigh attempt and dirty failed rerun artifacts. GLM 5.2 used 8.3M input tokens and 422k output tokens across the validated row, with a 39.4-second P50 chunk time and a 6.6-minute P90. The row is useful, but the parser issue should be fixed before treating GLM 5.2 as a routine unattended benchmark target.

Corpus

The Sentry vulnerability corpus lists the known issues used for scoring. Each entry includes the repository SHA, the affected file, a short vulnerability description, and the relevant code snippet.

Run It

Use the running guide to reproduce the benchmark, add a new model run, and record sanitized result metadata.