Blog — Dockier

Most PII-detection demos look like this: paste a schema, watch GPT-4o emit a confident list of columns. It's a great demo and a bad product. Confidence is not accuracy, the model bills per token, and "the LLM said so" is the wrong defence to bring to a SOC 2 auditor.

We shipped a 320-line schema parser that beats GPT-4o on the benchmark we care about. We also still ship the LLM. Here's where each one wins.

In the Dockier stack · The sensitive-data scanner ships inside the code-analysis Fastify service (TypeScript, Zod-typed routes, OpenAPI docs at /docs). It runs on every commit hook and on-demand from the React dashboard. Labelled findings are written to Supabase Postgres and joined to the project's RLS-scoped row.

What we actually parse

The parser reads schemas — not data — for five sources: raw SQL DDL, Prisma, Eloquent (Laravel), TypeScript-with-decorators (TypeORM / MikroORM), and Python (SQLAlchemy / Pydantic). For each table or model we extract:

Column name and type.
Comments and decorator metadata (e.g. @Index, @Encrypted).
Relations — because users.email being PII propagates a label to signups.email if they're related, even if the second isn't named obviously.
Any nearby validation rules — Zod schemas, FastAPI models, Laravel form requests.

Classification is then a deterministic decision tree: column name regex, type check, comment override, relation propagation. There is no model.

The benchmark

We built a benchmark from 12 open-source codebases — Mastodon, PostHog, Discourse, Cal.com, and so on — totalling 3,400 columns. We hand-labelled them across four classes: None, PII, Sensitive, Secret.

The parser alone outperforms the LLM at one-eighteenth of the cost. The hybrid is what we actually ship — using the LLM only when the parser's confidence is below threshold.

Parser only: 96.2% precision, 91.4% recall.
GPT-4o only: 88.5% precision, 84.7% recall, $0.07 per repo.
Parser → LLM tie-breaker: 97.8% precision, 93.1% recall, $0.004 per repo.

The parser wins because PII detection is overwhelmingly a pattern-matching problem with a known small vocabulary. email, phone, ssn, address, dob, passport — the long tail of edge cases is shorter than people think. The model wins on ambiguous columns where the name carries no signal — note_field, meta, payload — and we use it to break ties when the parser's confidence is below a threshold.

Why the parser was 12× faster

A scan of 3,400 columns runs in ~180 ms on a laptop. The same workload through an LLM takes ~2,200 ms even with batched prompts, because you're bottlenecked on TTFT.

For a scanner that runs on every commit, this is the difference between developers leaving the integration on and turning it off.

The edge cases that bit us

Polymorphic columns. Eloquent's morphTo tables have a {name}_id + {name}_type pair. The id isn't PII on its own; with the type column it might be. We had to teach the parser about morph relations.
JSON blobs. A column called metadata jsonb is a hole the parser can't see into. We added a config knob to mark any JSON column as Sensitive by default with an explicit opt-out.
Pydantic optional unions. Union[str, None] with a Field description that mentioned "email" used to fall through the cracks. Fixed.

When you should still use the LLM

We still call it in three situations: when the column name is opaque (data, payload), when there's a comment that needs interpreting ("stores user-supplied feedback"), and when relations infer labels we want a second opinion on. Each call is a single ~120-token prompt; the bill is closer to free than to expensive.

If you want to reproduce the benchmark, the labelled dataset lives in the public dockier/pii-benchmark repo. PRs welcome — especially for Ruby/Active Record models, which we under-tested.

Detecting sensitive data without an LLM

What we actually parse

The benchmark

Why the parser was 12× faster

The edge cases that bit us

When you should still use the LLM

Keep reading

OSV.dev: the dependency scanner you already have

What we found scanning our own codebase on day one

From Encore to Fastify: rebuilding our scanner on plain TypeScript