brass benchmark results
Published from
docs/benchmarks/public/render_public.py. Last refreshed 2026-05-18. Each linked project has a per-project page with reproducible scan instructions.
brass is a Python CLI that statically analyzes codebases for security vulnerabilities, AI-introduced anti-patterns, and code- quality issues. These benchmarks show what brass actually produces on a curated set of known-vulnerable and well-maintained third-party projects — pinned at specific commits so any reader can reproduce.
Track A — documented-vulnerability detection
Intentionally-vulnerable training corpora maintained by security projects. Each entry has a manifest of specific lines brass catches (required) and gaps brass doesn’t yet catch (aspirational).
| Project | Required findings | Aspirational gaps | Categories |
|---|---|---|---|
| OWASP PyGoat | 7 | 4 | command_injection, hardcoded_credential, weak_crypto |
| OWASP NodeGoat | 8 | 4 | code_injection, hardcoded_credential |
| PyCQA Bandit examples/ | 23 | 0 | assert_in_production, code_injection, command_injection, deserialization, hardcoded_credential, insecure_authentication, insecure_binding, insecure_permissions, insecure_tempfile, insecure_transport, path_traversal, security_misconfiguration, sql_injection, weak_crypto, xss |
| Yelp detect-secrets test_data/ | 6 | 3 | hardcoded_credential |
| Snyk Goof | 4 | 3 | hardcoded_credential |
Track B — output on real-world code
Mature, professionally-maintained open-source projects. NOT intentionally vulnerable. These show what brass produces on customer-shape codebases — used for noise-floor regression detection. Numbers should be stable across brass releases (±20% findings tolerance via CI).
| Project | Total findings | Critical | Wall time (s) | Top scanner |
|---|---|---|---|---|
| pallets/flask | 370 | 50 | 161.42 | PhantomAICodeScanner |
| tiangolo/fastapi | 848 | 50 | 584.45 | PhantomAICodeScanner |
| django/django | 1608 | 50 | 1333.35 | PhantomAICodeScanner |
| vercel/commerce | 0 | 0 | 3.6 | ? |
| vercel/turborepo | 210 | 50 | 131.03 | Brass2PrivacyScanner |
Methodology
brass version: each per-project page records the exact brass commit SHA that produced its numbers. Re-running brass at that SHA on the pinned upstream commit reproduces the metrics.
Scanners that ran: PhantomAICodeScanner, Brass2PrivacyScanner, auth_pattern_analyzer, SecretsScanner, AIContextCoherenceScanner, JavaScriptTypeScriptScanner, BrassPerformanceScanner, ContentModerationScanner, input_validation_analyzer, AstGrepScanner, PysaTaintScanner, bandit. Some scanners depend on external binaries (bandit, ast-grep, pyre, semgrep, node); the CI environment has all of them installed. Customer environments may produce different numbers if any of these are missing.
Enrichment: published numbers come from brasscoders --offline scan
(no AI enrichment). The enriched output Paid-plan customers see has
the same underlying findings but with AI-mediated clustering,
re-ranking, and contextual rationale per finding.
True-positive rate: Track A’s required_findings count IS the
true-positive surface for documented vulnerabilities in each corpus.
Aspirational entries are KNOWN gaps — published honestly so customers
can judge brass’s actual coverage.
False-positive rate: NOT measured here. Track B’s noise floor
(e.g., pallets/flask produces 370 findings despite being a
professionally-audited mature framework) gives some signal — most of
those are AI-anti-pattern or info-level signals, not real bugs.
Comparative FP rate against Snyk/Semgrep/SonarQube is future work.
Completeness vs competitors: NOT claimed. We don’t run brass alongside Snyk/Semgrep/SonarQube on the same corpora; published numbers reflect brass’s behavior in isolation.
Refresh cadence: quarterly. SHAs get re-pinned to recent stable releases; brass commit is bumped to the latest stable. Maintainers diff against the previous publication to show “what changed.”
What brass doesn’t try to do
- Run / instrument target code: static analysis only. brass reads source files as bytes, never executes them.
- Scan vulnerable npm dependencies: that’s
npm audit/ Snyk territory. brass focuses on source-code patterns. - Auto-fix or open PRs: detection is the product. The customer decides what to fix.
- Replace a security review: brass surfaces signals; humans triage. We’re transparent about gaps (see Track A aspirational lists).