Handling PII Data in Customer-Facing AI Chatbots

When you're building a customer-facing chatbot, you can't control who the user is or what they'll type. Even if you warn "don't share sensitive information," people will still paste personal data (PII) sometimes by mistake. OpenAI itself reminds users that ChatGPT can be helpful but isn't always right, so important information should be verified.

The PII Challenge in LLM Chatbots

Large language models (LLMs) don't update themselves per request, but vendors may retain request/response logs for safety, abuse monitoring, or legal reasons.

OpenAI API: logs are typically retained up to 30 days by default; Zero Data Retention is available for eligible endpoints. For ChatGPT Enterprise/Edu, workspace admins control retention, and deleted conversations are removed within ~30 days unless law requires otherwise.
Anthropic Claude: users can delete conversations (removed from history and back-end within ~30 days), but policy-flagged content may be retained longer (e.g., up to 2 years for inputs/outputs).

Beyond retention, cross-border data transfers can trigger compliance duties:

Under the GDPR, transfers outside the EU require an adequacy decision or appropriate safeguards (SCCs/BCRs).
Under Singapore's PDPA, the Transfer Limitation Obligation requires ensuring overseas recipients protect data to a standard comparable to PDPA.

Common PII-Handling Approaches

1) Filter PII Out Completely

Detect and strip PII before sending a request to the LLM. Risk: removing entities like names, emails, phones can erase critical context, leading to worse task performance or misinterpretations. Research shows de-identification can negatively impact NLP accuracy (magnitude varies by task and method).

2) Mask PII with Typed Placeholders

Replace sensitive spans with structured tags (e.g., <Email_1>, <Phone_2>). Benefit: preserves semantics ("there is an email/phone here"), improving answer quality vs. blunt redaction. This aligns with pseudonymization/de-identification guidance (keep re-identification keys inside your perimeter). Caveat: as usage scales, you need robust tagging, mapping, and audit controls.

How HoverBot Protects PII: Our Two-Stage Local ML Workflow

We safeguard user privacy and context using a two-stage machine learning (ML) process that runs entirely on our own infrastructure. Both stages are powered by local models, ensuring that personally identifiable information (PII) is never exposed to external systems.

1) Fast PII Detection (Boolean Gate)

A lightweight model scans every incoming message.

No PII found? We route to the LLM immediately (low latency).
PII found? We escalate to step 2.

2) Advanced Tagging & Masking

A second service classifies sensitive spans (emails, phones, names, etc.) and replaces them with dynamic tags (<Email_1>, <Phone_2>, <Person_3>). We include instructions so the LLM understands tags and doesn't alter them.

Post-LLM Unmasking (Server-Side)

We replace tags with the original values inside our environment after the model responds.

The originals never leave our perimeter; processing is region-pinned to meet residency needs.
Where we must call external LLMs, we enable vendor controls (e.g., OpenAI Zero Data Retention) when applicable.

Why This Architecture Matters

Data remains under our control. Sensitive fields are masked before any external call; unmasking happens only on our servers.

Context is preserved. Typed placeholders reduce the accuracy hit seen with blunt redaction.

Compliance-friendly by design. Regional pinning and transfer controls align with GDPR and PDPA requirements.

Final Thought

A banner isn't a security control. If you're shipping a client-facing AI chatbot, build a data-protection layer from day one: keep originals inside your perimeter, use structured masking to preserve context, and configure vendor retention/residency options so sensitive data never leaves your secure environment.