Dockier
Security

Detecting sensitive data without an LLM

Dockier Security · 13 Apr 2026 · 7 min read

Back to blog

Most PII-detection demos look like this: paste a schema, watch GPT-4o emit a confident list of columns. It's a great demo and a bad product. Confidence is not accuracy, the model bills per token, and "the LLM said so" is the wrong defence to bring to a SOC 2 auditor.

We shipped a 320-line schema parser that beats GPT-4o on the benchmark we care about. We also still ship the LLM. Here's where each one wins.

In the Dockier stack · The sensitive-data scanner ships inside the code-analysis Fastify service (TypeScript, Zod-typed routes, OpenAPI docs at /docs). It runs on every commit hook and on-demand from the React dashboard. Labelled findings are written to Supabase Postgres and joined to the project's RLS-scoped row.

What we actually parse

The parser reads schemas — not data — for five sources: raw SQL DDL, Prisma, Eloquent (Laravel), TypeScript-with-decorators (TypeORM / MikroORM), and Python (SQLAlchemy / Pydantic). For each table or model we extract:

  • Column name and type.
  • Comments and decorator metadata (e.g. @Index, @Encrypted).
  • Relations — because users.email being PII propagates a label to signups.email if they're related, even if the second isn't named obviously.
  • Any nearby validation rules — Zod schemas, FastAPI models, Laravel form requests.

Classification is then a deterministic decision tree: column name regex, type check, comment override, relation propagation. There is no model.

The benchmark

We built a benchmark from 12 open-source codebases — Mastodon, PostHog, Discourse, Cal.com, and so on — totalling 3,400 columns. We hand-labelled them across four classes: None, PII, Sensitive, Secret.

Precision & recall on the 3,400-column benchmark Higher is better · scale 80–100% 80 85 90 95 100 96.2 91.4 Parser only $0 · 180 ms 88.5 84.7 GPT-4o only $0.07 · 2.2 s 97.8 93.1 Parser → LLM tie-breaker $0.004 · 320 ms Precision Recall
The parser alone outperforms the LLM at one-eighteenth of the cost. The hybrid is what we actually ship — using the LLM only when the parser's confidence is below threshold.
  • Parser only: 96.2% precision, 91.4% recall.
  • GPT-4o only: 88.5% precision, 84.7% recall, $0.07 per repo.
  • Parser → LLM tie-breaker: 97.8% precision, 93.1% recall, $0.004 per repo.

The parser wins because PII detection is overwhelmingly a pattern-matching problem with a known small vocabulary. email, phone, ssn, address, dob, passport — the long tail of edge cases is shorter than people think. The model wins on ambiguous columns where the name carries no signal — note_field, meta, payload — and we use it to break ties when the parser's confidence is below a threshold.

Why the parser was 12× faster

A scan of 3,400 columns runs in ~180 ms on a laptop. The same workload through an LLM takes ~2,200 ms even with batched prompts, because you're bottlenecked on TTFT.

Latency on 3,400 columns · lower is better scale 0–2,400 ms Parser 180 ms GPT-4o 2,200 ms 12× faster · CI passes in 8 s instead of 35 s
For a scanner that runs on every commit, this is the difference between developers leaving the integration on and turning it off.

The edge cases that bit us

  • Polymorphic columns. Eloquent's morphTo tables have a {name}_id + {name}_type pair. The id isn't PII on its own; with the type column it might be. We had to teach the parser about morph relations.
  • JSON blobs. A column called metadata jsonb is a hole the parser can't see into. We added a config knob to mark any JSON column as Sensitive by default with an explicit opt-out.
  • Pydantic optional unions. Union[str, None] with a Field description that mentioned "email" used to fall through the cracks. Fixed.

When you should still use the LLM

We still call it in three situations: when the column name is opaque (data, payload), when there's a comment that needs interpreting ("stores user-supplied feedback"), and when relations infer labels we want a second opinion on. Each call is a single ~120-token prompt; the bill is closer to free than to expensive.

If you want to reproduce the benchmark, the labelled dataset lives in the public dockier/pii-benchmark repo. PRs welcome — especially for Ruby/Active Record models, which we under-tested.