Most PII-detection demos look like this: paste a schema, watch GPT-4o emit a confident list of columns. It's a great demo and a bad product. Confidence is not accuracy, the model bills per token, and "the LLM said so" is the wrong defence to bring to a SOC 2 auditor.
We shipped a 320-line schema parser that beats GPT-4o on the benchmark we care about. We also still ship the LLM. Here's where each one wins.
In the Dockier stack · The sensitive-data scanner ships inside thecode-analysisFastify service (TypeScript, Zod-typed routes, OpenAPI docs at/docs). It runs on every commit hook and on-demand from the React dashboard. Labelled findings are written to Supabase Postgres and joined to the project's RLS-scoped row.
What we actually parse
The parser reads schemas — not data — for five sources: raw SQL DDL, Prisma, Eloquent (Laravel), TypeScript-with-decorators (TypeORM / MikroORM), and Python (SQLAlchemy / Pydantic). For each table or model we extract:
- Column name and type.
- Comments and decorator metadata (e.g.
@Index,@Encrypted). - Relations — because
users.emailbeing PII propagates a label tosignups.emailif they're related, even if the second isn't named obviously. - Any nearby validation rules — Zod schemas, FastAPI models, Laravel form requests.
Classification is then a deterministic decision tree: column name regex, type check, comment override, relation propagation. There is no model.
The benchmark
We built a benchmark from 12 open-source codebases — Mastodon, PostHog, Discourse, Cal.com, and so on — totalling 3,400 columns. We hand-labelled them across four classes: None, PII, Sensitive, Secret.
- Parser only: 96.2% precision, 91.4% recall.
- GPT-4o only: 88.5% precision, 84.7% recall, $0.07 per repo.
- Parser → LLM tie-breaker: 97.8% precision, 93.1% recall, $0.004 per repo.
The parser wins because PII detection is overwhelmingly a pattern-matching problem with a known small vocabulary. email, phone, ssn, address, dob, passport — the long tail of edge cases is shorter than people think. The model wins on ambiguous columns where the name carries no signal — note_field, meta, payload — and we use it to break ties when the parser's confidence is below a threshold.
Why the parser was 12× faster
A scan of 3,400 columns runs in ~180 ms on a laptop. The same workload through an LLM takes ~2,200 ms even with batched prompts, because you're bottlenecked on TTFT.
The edge cases that bit us
- Polymorphic columns. Eloquent's
morphTotables have a{name}_id+{name}_typepair. The id isn't PII on its own; with the type column it might be. We had to teach the parser about morph relations. - JSON blobs. A column called
metadata jsonbis a hole the parser can't see into. We added a config knob to mark any JSON column as Sensitive by default with an explicit opt-out. - Pydantic optional unions.
Union[str, None]with a Field description that mentioned "email" used to fall through the cracks. Fixed.
When you should still use the LLM
We still call it in three situations: when the column name is opaque (data, payload), when there's a comment that needs interpreting ("stores user-supplied feedback"), and when relations infer labels we want a second opinion on. Each call is a single ~120-token prompt; the bill is closer to free than to expensive.
If you want to reproduce the benchmark, the labelled dataset lives in the public dockier/pii-benchmark repo. PRs welcome — especially for Ruby/Active Record models, which we under-tested.