OpenAI Privacy Filter

OpenAI Privacy Filter is an open-weight, Apache 2.0-licensed token-classification model published by OpenAI on HuggingFace for detecting and masking personally identifiable information (PII) in text. It is designed for high-throughput data sanitization workflows: feeding LLMs, redacting datasets, on

Canonical version: OpenAI Privacy Filter.

OpenAI Privacy Filter is an open-weight, Apache 2.0-licensed token-classification model published by OpenAI on HuggingFace for detecting and masking personally identifiable information (PII) in text. It is designed for high-throughput data sanitization workflows: feeding LLMs, redacting datasets, on-prem privacy filtering. It is not a compliance guarantee — it is one layer in a broader privacy-by-design stack.

This is a notable release because OpenAI rarely ships open weights, and because privacy filtering is one of the highest-leverage things to add in front of any LLM pipeline that touches user data — particularly for anyone running BYOK or local inference setups (see Where Your AI Prompts Really Go - A Practical Guide to AI Privacy (Article)).

Specs

Aspect Detail
Developer OpenAI
License Apache 2.0 License (permissive)
Total params 1.5B
Active params 50M (sparse MoE, 128 experts, top-4)
Architecture 8 transformer blocks, grouped-query attention, sparse MoE, single-pass
Context window 128k tokens
Decoding Constrained Viterbi over BIOES span labels
Distribution HuggingFace (openai/privacy-filter), runs via Transformers or Transformers.js (browser)

What it detects

Eight PII span categories:

  1. Account number
  2. Private address
  3. Private email
  4. Private person (name)
  5. Private phone
  6. Private URL
  7. Private date
  8. Secret

The model labels all tokens in a single pass (not autoregressive), then runs constrained Viterbi to produce coherent BIOES spans. A runtime threshold tunes the precision/recall tradeoff per use case.

Why it matters

  • Open weights from OpenAI is rare. It also lowers the bar for self-hosted, on-prem privacy filtering.
  • Right place in the stack. PII detection belongs in front of any third-party LLM call, especially for cloud AI APIs that retain inputs by default. See the four data paths in Where Your AI Prompts Really Go - A Practical Guide to AI Privacy (Article).
  • Small, efficient. 50M active params + 128k context means it can sit in front of a request without dominating latency or cost.
  • Fine-tunable. Domains with their own privacy taxonomies (medical, legal, HR) can extend the label set.

Limitations

  • Not a compliance guarantee — should be one layer in a broader privacy-by-design approach.
  • Performance drops on non-English text, non-Latin scripts, and underrepresented naming patterns.
  • Failure modes include under-detection of uncommon names and over-redaction of public entities.
  • High-sensitivity settings (medical, legal, financial, HR) still require human review.

Use cases

Use case How the filter fits
Pre-prompt sanitization Strip PII before sending prompts to a third-party LLM API
Dataset cleaning Redact PII from training, fine-tuning, or analytics datasets
On-prem privacy gateway Filter inbound/outbound text in a self-hosted AI pipeline
Browser-side filtering Run via Transformers.js to redact before data ever leaves the device
Custom redaction policies Fine-tune for domain-specific PII (medical IDs, legal refs)

References


About Sébastien

I'm Sébastien Dubois, and I'm on a mission to help knowledge workers escape information overload. After 20+ years in IT and seeing too many brilliant minds drowning in digital chaos, I've decided to help people build systems that actually work. Through the Knowii Community, my courses, products & services and my Website/Newsletter, I share practical and battle-tested systems.

I write about Knowledge Work, Personal Knowledge Management, Note-taking, Lifelong Learning, Personal Organization, Productivity, and more. I also craft lovely digital products and tools.

If you want to follow my work, then become a member and join our community.

Ready to get to the next level?

If you're tired of information overwhelm and ready to build a reliable knowledge system:

Found this valuable? Share it with someone who needs it.

Join 6,000+ readers. Get practical systems for knowledge & AI. Free.

Subscribe ✨

Free: Knowledge System Checklist

A clear roadmap to building your own knowledge system. Subscribe and get it straight to your inbox.

6,000+ readers. No spam. Unsubscribe anytime.