We Benchmarked BrassCoders Against a Frontier Model

Head-to-head on 12 AI-generated Python files: BrassCoders 11/12, Claude 12/12, Bandit 6/12, Pylint 1/12. What the numbers mean for your pre-merge workflow.

Copper Sun Brass Team · · 6 min read · Updated June 25, 2026
benchmarksai-code-reviewoss-core
We Benchmarked BrassCoders Against a Frontier Model

A frontier model asked to review AI-generated Python caught 12 of 12 planted bugs. BrassCoders, the scanner that catches what AI assistants structurally miss, caught 11. Bandit caught 6. Pylint caught 1. These are real numbers from a committed, reproducible benchmark run on June 13, 2026 against BrassCoders 2.0.8.

The number that matters isn’t 11 against 12. It’s the category breakdown, and the four categories where BrassCoders is the only tool that caught anything.

The Benchmark Setup

BrassCoders’s AI-coder-bug benchmark is an open dataset of 12 AI-generated Python files, each carrying one planted bug, plus two clean controls. Every file ships a PROVENANCE docstring naming the prompt that produced it. The corpus, manifest, and runner live in the OSS repo at github.com/CopperSunDev/brasscoders under cli/docs/benchmarks/ai-coder-bugs/, Apache 2.0 licensed and reproducible by anyone.

Scoring is per-category catch-rate. A tool catches a bug if it emits any finding for that file that maps to the bug’s category through pre-registered signal patterns in the runner. Those patterns were committed before any run took place, so they weren’t tuned to produce a flattering result. The runner is 600 lines. All of it open.

Four tools ran: BrassCoders 2.0.8, Bandit 1.8.3 (the version BrassCoders bundles internally), Pylint, and Claude sonnet-4-6, the last reached through the Anthropic API with a realistic pre-merge review prompt.

The Result That Matters: The Wedge Is Real

BrassCoders caught all four AI-coder performance anti-patterns, the category every other static analyzer scored zero on. Bandit caught none. Pylint caught none. Semgrep caught none.

The four bugs in the wedge category:

  • O(N²) string concatenation: a csv += loop that rebuilds the whole string on every row. Bandit 0, Pylint 0, ask-model 1, BrassCoders 1.
  • list.insert(0) in a loop: prepend-by-shifting the whole list on every insert. Bandit 0, Pylint 0, ask-model 1, BrassCoders 1.
  • Triple-nested loop for a join: three nested for loops over customers × orders × items where a dict-lookup would be O(N). Bandit 0, Pylint 0, ask-model 1, BrassCoders 1.
  • Unbounded while True: a socket-drain loop with no break, no timeout, no size cap. Bandit 0, Pylint 0, ask-model 1, BrassCoders 1.

The pattern holds. These are the bugs AI coding assistants introduce because the prompt described the happy path. The model wrote idiomatic-looking code that passes a small unit test. The bug surfaces only at volume. No generic security rule fires. No type error exists. A linter built around security rules and style checks has no mechanism to flag them.

BrassCoders carries four AST-level rules that match these anti-patterns directly: string concatenation in a loop, list.insert(0) in a loop, nesting deeper than a threshold, and while True with no exit. The rules are dumb. The detection is reliable.

What the Model Wins On

BrassCoders missed one bug, and Claude sonnet-4-6 caught it: sum(readings) / len(readings) with no empty-list guard, a ZeroDivisionError on [] that leaves no structural marker for a rule to match.

The model also caught every security category. SQL injection through an f-string. Command injection through subprocess shell=True. XSS through autoescape=False. Unsafe pickle deserialization. BrassCoders caught these too, but the model would have caught them without any rules, because it reasons about intent. That’s the difference.

The overall table, from real tool output on the committed corpus:

ToolCaughtCatch Rate
BrassCoders 2.0.811 / 1292%
Claude sonnet-4-612 / 12100%
Bandit 1.8.36 / 1250%
Pylint1 / 128%

Zero false positives on either clean control, across every tool.

The Generation-Mode Finding

BrassCoders also ran a generation-mode probe: six neutral coding tasks with no mention of bugs or performance, asking the model to write each from scratch, then scanning the output and asking the model to self-review.

The model wrote clean code on four of five wedge tasks. Claude sonnet-4-6 in June 2026 reaches for io.StringIO instead of string concatenation, guards the empty-list case on its own, and merges dicts with {**a, **b}. The model has improved since the first benchmark version ran in early June.

On the one task where the model did introduce a bug, an insert(0) loop on the recent-feed task, both BrassCoders and the model’s self-review caught it.

The more telling number: the model issued zero proactive warnings while writing the code. It wrote the insert(0) loop and moved on. It flagged the problem only when asked to review. A gate that needs an explicit invocation isn’t a gate.

The Axes BrassCoders Wins That Don’t Show in a Catch-Rate Table

BrassCoders ran the same 12 files three times and produced identical results every time. That repeatability is what makes it a gate, and it’s the axis a catch-rate table never measures.

Determinism. The model’s review of the same 12 files returns different output run to run, because LLMs sample non-deterministically. A CI gate that returns different verdicts on the same commit is noise, not signal.

Cost per run. The BrassCoders OSS core costs nothing to run. No API calls, no token charges. At 50 commits per developer per month and a 500-line average diff, a model review adds $0.15 to $0.25 per developer per month at current API pricing. The dollar figure isn’t the real point. A $0 scan is a scan nobody has to approve.

Bytes sent off-machine. Running brasscoders --offline scan sends zero bytes off the machine. A model review sends the full source to an external API. For any codebase with data-handling rules, only BrassCoders can run on every commit.

CI automatability. BrassCoders runs as a standard command in a GitHub Actions step. The model review runs when a developer remembers to ask. Both help. Only one runs automatically.

How to Reproduce

BrassCoders ships the full benchmark corpus in its OSS repo, so the static-tool numbers reproduce exactly. Clone the repo, install the tools, and run the runner with --no-ask-model:

git clone https://github.com/CopperSunDev/brasscoders
pip install brasscoders bandit pylint pyyaml
python cli/docs/benchmarks/ai-coder-bugs/run_benchmark.py --no-ask-model

The output should match the committed results/RESULTS.md within the margin of your installed BrassCoders version. To reproduce the ask-model column, set ANTHROPIC_API_KEY and drop --no-ask-model. Model results vary a little because LLM sampling is non-deterministic, but the overall catch rate has held steady across runs.

The benchmark is the dataset. The runner is the scoring logic. Both are open, Apache 2.0 licensed, and built to extend. Want a new bug category? Add the pattern to the manifest and commit its signal patterns to CATEGORY_SIGNALS before you run.

What the Numbers Say

BrassCoders isn’t the smarter reviewer. Claude sonnet-4-6 is smarter, and it caught the one bug that needs reasoning instead of rules. BrassCoders is the deterministic gate: the same scan and the same findings on every commit, in CI, for free, with nothing sent off-machine.

Those are the categories it wins. The benchmark shows both sides.

pip install brasscoders
brasscoders --offline scan /path/to/your/project

Frequently Asked Questions

Did BrassCoders beat the AI model?

No, and BrassCoders doesn't claim it did. Claude sonnet-4-6 caught 12 of 12 planted bugs; BrassCoders caught 11. The miss was an unguarded division, a len() call with no empty-list check, which defeated every static tool and would defeat almost any deterministic rule system. BrassCoders matches a frontier model on the categories where rules work, and loses on the one category where reasoning is the only path.

What's the wedge BrassCoders actually wins?

AI-coder performance anti-patterns. BrassCoders caught all four: O(N²) string concat, an insert-at-zero loop, a triple-nested join, and an unbounded while True. Bandit, Semgrep, and Pylint caught zero of four. These are the bugs AI coding assistants introduce when the prompt describes the happy path and not the bounds. The code looks correct, passes a unit test, and degrades silently at volume.

Why not just ask the model to review every PR?

The model reviews only when asked. In BrassCoders's generation-mode probe, the model introduced performance bugs it never warned about while writing the code, and caught zero of six in its own output without an explicit review prompt. A gate that runs automatically on every commit, deterministically, is a different tool from a model you have to remember to ask.

Is the benchmark reproducible?

Yes. The corpus, manifest, and runner are committed at github.com/CopperSunDev/brasscoders under cli/docs/benchmarks/ai-coder-bugs/. Install the three tools, then run python run_benchmark.py --no-ask-model. The static-tool results reproduce exactly. The ask-model results vary a little because LLMs are non-deterministic, but the overall catch rate has stayed stable across runs.

What does BrassCoders miss?

Pure logic bugs with no structural marker. The unguarded-division case, dividing by len() with no empty-list check, defeated every tool in the benchmark, including the model on some runs. BrassCoders also misses performance patterns outside its current rule set: an O(N²) membership test in a loop (x not in growing_list) and no-timeout polling, both of which the model caught on self-review but BrassCoders did not.