brass benchmark results

Published from docs/benchmarks/public/render_public.py. Last refreshed 2026-05-18. Each linked project has a per-project page with reproducible scan instructions.

brass is a Python CLI that statically analyzes codebases for security vulnerabilities, AI-introduced anti-patterns, and code- quality issues. These benchmarks show what brass actually produces on a curated set of known-vulnerable and well-maintained third-party projects — pinned at specific commits so any reader can reproduce.

Track A — documented-vulnerability detection

Intentionally-vulnerable training corpora maintained by security projects. Each entry has a manifest of specific lines brass catches (required) and gaps brass doesn’t yet catch (aspirational).

Project	Required findings	Aspirational gaps	Categories
OWASP PyGoat	7	4	command_injection, hardcoded_credential, weak_crypto
OWASP NodeGoat	8	4	code_injection, hardcoded_credential
PyCQA Bandit examples/	23	0	assert_in_production, code_injection, command_injection, deserialization, hardcoded_credential, insecure_authentication, insecure_binding, insecure_permissions, insecure_tempfile, insecure_transport, path_traversal, security_misconfiguration, sql_injection, weak_crypto, xss
Yelp detect-secrets test_data/	6	3	hardcoded_credential
Snyk Goof	4	3	hardcoded_credential

Track B — output on real-world code

Mature, professionally-maintained open-source projects. NOT intentionally vulnerable. These show what brass produces on customer-shape codebases — used for noise-floor regression detection. Numbers should be stable across brass releases (±20% findings tolerance via CI).

Project	Total findings	Critical	Wall time (s)	Top scanner
pallets/flask	370	50	161.42	`PhantomAICodeScanner`
tiangolo/fastapi	848	50	584.45	`PhantomAICodeScanner`
django/django	1608	50	1333.35	`PhantomAICodeScanner`
vercel/commerce	0	0	3.6	`?`
vercel/turborepo	210	50	131.03	`Brass2PrivacyScanner`

Methodology

brass version: each per-project page records the exact brass commit SHA that produced its numbers. Re-running brass at that SHA on the pinned upstream commit reproduces the metrics.

Scanners that ran: PhantomAICodeScanner, Brass2PrivacyScanner, auth_pattern_analyzer, SecretsScanner, AIContextCoherenceScanner, JavaScriptTypeScriptScanner, BrassPerformanceScanner, ContentModerationScanner, input_validation_analyzer, AstGrepScanner, PysaTaintScanner, bandit. Some scanners depend on external binaries (bandit, ast-grep, pyre, semgrep, node); the CI environment has all of them installed. Customer environments may produce different numbers if any of these are missing.

Enrichment: published numbers come from brasscoders --offline scan (no AI enrichment). The enriched output Paid-plan customers see has the same underlying findings but with AI-mediated clustering, re-ranking, and contextual rationale per finding.

True-positive rate: Track A’s required_findings count IS the true-positive surface for documented vulnerabilities in each corpus. Aspirational entries are KNOWN gaps — published honestly so customers can judge brass’s actual coverage.

False-positive rate: NOT measured here. Track B’s noise floor (e.g., pallets/flask produces 370 findings despite being a professionally-audited mature framework) gives some signal — most of those are AI-anti-pattern or info-level signals, not real bugs. Comparative FP rate against Snyk/Semgrep/SonarQube is future work.

Completeness vs competitors: NOT claimed. We don’t run brass alongside Snyk/Semgrep/SonarQube on the same corpora; published numbers reflect brass’s behavior in isolation.

Refresh cadence: quarterly. SHAs get re-pinned to recent stable releases; brass commit is bumped to the latest stable. Maintainers diff against the previous publication to show “what changed.”

What brass doesn’t try to do

Run / instrument target code: static analysis only. brass reads source files as bytes, never executes them.
Scan vulnerable npm dependencies: that’s npm audit / Snyk territory. brass focuses on source-code patterns.
Auto-fix or open PRs: detection is the product. The customer decides what to fix.
Replace a security review: brass surfaces signals; humans triage. We’re transparent about gaps (see Track A aspirational lists).