How BrassCoders Flags PII in AI-Generated Code

AI assistants drop real-looking names, emails, and SSNs into fixtures and stubs. BrassCoders flags PII-shaped strings in source before they reach a shared repo.

Copper Sun Brass Team · · 4 min read
privacyoss-core

AI assistants generate test fixtures, seed data, and config stubs full of realistic-looking people: a name, an email, a phone number, sometimes a Social Security number. Some of it is synthetic. Some of it is real, reproduced from training data. Either way it lands in a file, gets committed, and persists in git history long after anyone remembers it was there. BrassCoders flags the PII-shaped strings in your source before they reach a shared repo.

PII in Source Is Its Own Exposure

BrassCoders treats personally identifiable information in source code as a separate exposure from PII in a database, because it leaks through a different door. Database PII sits behind access controls and audit logs. PII hardcoded into a fixture sits in version control, ships in every clone of the repo, and survives in history even after someone deletes the line.

That’s why it slips past normal data-governance review entirely. A privacy team auditing the production database never looks at tests/fixtures/users.py. OWASP files this under Sensitive Data Exposure — sensitive data ending up somewhere it shouldn’t, with no malicious actor required. A hardcoded real email in a test file is exactly that.

Why AI-Generated Code Carries PII

BrassCoders treats AI-generated fixtures as a likely PII source for two reasons. The first is mundane: assistants write realistic example data because the tutorials they trained on did, so generated fixtures read like real records rather than obvious placeholders.

The second is the one to take seriously. Carlini et al. (ICLR 2023) showed that language models memorize and reproduce verbatim sequences from training data, including names, email addresses, and phone numbers scraped from public repositories. A realistic-looking email in generated code might be invented, or it might be a real person’s address the model reproduced. The generated fixture is not safely assumed to be synthetic.

What BrassCoders Flags

BrassCoders’s privacy scanner flags strings that match PII shapes: phone-number patterns in string literals, Social Security number patterns, and email addresses in hardcoded fixtures. Each match is reported with its file and line, and the matched value is redacted at the scanner, so the finding records the location and the type without writing the data itself into the output.

The redaction is the same two-point contract the secret scanner uses. The value never reaches the output file, and the OSS core makes no outbound network calls, so a PII finding is safe to hand to Claude Code or Cursor for triage without forwarding the data anywhere. You see email pattern at fixtures/users.py:22, not the address.

Pattern, Not Verdict

BrassCoders reports that a string looks like PII; it does not rule on whether the string is a real person’s data. A phone-shaped literal could be a customer’s number or a 555 placeholder from a test, and the scanner can’t see the difference — that’s context. So it surfaces every candidate and leaves the real-versus-synthetic call to you and your AI assistant.

This is the deterministic-reporter split again. The scanner’s value is recall: it deterministically catches every PII-shaped string, with no guess that would quietly drop a real one. Your assistant helps sort the obvious fixtures from the genuine identifiers, and the regulated-data judgment — what has to be scrubbed, what’s a fine synthetic value — stays with the human who knows the data. Teams under HIPAA or SOC 2 obligations can wire that into a gate, as in HIPAA and SOC 2 code scanning.

Run It

The PII scan runs in every scan, no flag required:

pipx install brasscoders
brasscoders --offline scan

PII findings land in .brass/ next to the security and secrets findings, redacted, ready to triage. For the full set of detectors in the same pass, see what BrassCoders detects; for the broader map of what AI assistants miss, the AI Blind Spots pillar covers PII flows among the seven categories.

Frequently Asked Questions

What counts as PII in source code?

Personally identifiable information hardcoded into files: names, email addresses, phone numbers, Social Security numbers, and similar identifiers in test fixtures, config stubs, or example data. PII in source is a distinct exposure from PII in a database — it lands in version control, ships in the repo, and persists in history. BrassCoders flags the PII-shaped strings it finds at scan time.

How is PII in code different from a secret?

A secret is a credential (an API key, a token); PII is a person's data. They leak the same way, hardcoded into a file and committed, but the obligations differ — privacy regulations versus credential rotation. BrassCoders runs separate detectors for each, so a leaked email and a leaked AWS key are reported as distinct finding types.

Does BrassCoders know whether a flagged string is real PII?

No, and it doesn't pretend to. It flags strings that match PII shapes — a phone-number pattern, an SSN pattern, an email in a fixture. Whether a given match is a real person's data or a synthetic test value is context the scanner can't see; that triage is yours and your AI assistant's. The scanner's job is to surface every candidate.

Why do AI assistants generate PII?

Two reasons. They produce realistic-looking fixtures and example data because that's what their training data modeled. And they reproduce real PII verbatim — Carlini et al. (ICLR 2023) showed language models reproduce memorized sequences, including names, emails, and phone numbers, from training data. A realistic email in generated code is not safely assumed to be invented.

How do I run the PII scan?

It runs in every brasscoders scan, with no flag. Install with pipx install brasscoders and run brasscoders --offline scan; PII findings land in .brass/ alongside the security and secrets findings, and like every finding the matched value is redacted at the scanner.