1. AI reviewers err on the side of completeness.

The training incentive for Claude / GPT / Gemini-class reviewers rewards thoroughness. That's correct in the abstract — better to over-flag than miss a real bug. In practice, it produces output where every PR review has eight items and seven are conditional ("if name is user-controlled, this could be XSS").

Developers compensate by skimming. Skimming has a known failure mode: the one real critical issue gets dismissed alongside the seven speculative ones. The output becomes uniformly low-trust.

This is a known industry-level problem. The Hacker News thread "There is an AI code review bubble" has the discussion. Anthropic acknowledged in April 2026 that a verbosity-constraint change "caused a 3% drop in coding quality evaluations." The Register's April 23 2026 piece on Claude Opus 4.7 false-positive AUP refusals covered the same shape from a different angle.

2. A noise filter is the right intervention.

You can't fix the underlying training incentive — that's Anthropic / OpenAI / Google's problem and they're working on it. What you can do is intercept the output and apply a deterministic filter before a human ever sees it.

Brass's filter is small and explainable: a confidence threshold per finding type, a style-issue allowlist (Pylint C0301 et al. drop), per-file caps so no single file dominates, and a hard "CRITICAL severity always survives" rule. The math fits in a 60-line Python module. There's no LLM in the loop; the reasoning is auditable.

That filter then runs against two inputs: (a) Brass's own scanners (detect-secrets, bandit, pylint, radon, plus AI-coder anti-pattern AST analysis), and (b) any third-party AI reviewer's JSON output, via brassai filter. Either way, the artifact you read is already triaged.

3. Why not the alternatives?

Why not just use Claude Code's built-in review?

Brass doesn't replace it; Brass filters its output. brassai filter takes Claude Code's review JSON in and emits the survivors. Use both.

Why not `claude-mem` for memory between sessions?

claude-mem is excellent and we recommend it. Different problem: claude-mem remembers what was said; Brass filters what was found. They compose well — you can use both, and the memory layer doesn't conflict with the review filter.

Why not Greptile / Cursor's review / SonarQube?

Those generate findings. Brass triages findings. The problem isn't "we need more findings"; it's "we need fewer findings that are higher signal." Brass is the layer that makes any of the others useful instead of overwhelming.

Why CLI and not an IDE plugin?

CLI runs in CI, in pre-commit hooks, in editor-on-save, on a server. IDE plugins run in one IDE. Composability wins.

Why not a SaaS?

Brass scans private source code. We don't want to be a SOC2 audit target; you don't want to upload your code to a vendor for static analysis. Run it on your machine; output stays on your disk.

Reproducible benchmarks

We don't ship vague claims. The benchmark script clones nine pinned public Python projects, runs Brass on each, and reports finding counts and runtimes — reproducible on any machine.

Read the latest benchmark results · benchmark.py source

Why Brass