Guardrails that are not just regex.
"We'll add a regex for that" is how most LLM safety work starts and how most of it ends. The class of problem has moved; the class of defence has to move with it. What actually works when you ship LLMs to the public.
The safety layer of an LLM product is not a filter. It is a pipeline of detectors, policies and fallbacks — layered so that no single component is load-bearing. If any one piece fails, the system degrades politely. If the whole pipeline fails, the system refuses to answer.
The input side
Users will try to jailbreak you. The question is whether the first jailbreak that works is the one that leaks a customer record. Three layers, in order:
- 01Prompt-injection detection — small classifier that flags typical patterns (override-instruction-style attacks, role hijacks, encoded instructions). Not perfect. Stops the cheap ones.
- 02PII redaction — named entity recognition strips user-side PII before it reaches the model if your product does not need it. Reduces blast radius on every other failure.
- 03Intent classification — is this query in-scope for the product? Out-of-scope queries go to a canned refusal, not the LLM. A lot of jailbreaks are just out-of-scope queries dressed up.
The output side
The model will produce outputs it should not. Your job is to catch them before the user sees them.
- Critique pass — a second, cheaper LLM reviews the output against your policy. Does it leak PII? Does it violate the scope? Does it cite a source that does not support the claim?
- Content policy checks — toxicity, self-harm, adult content, brand-specific rules. Use the vendor's moderation API plus a tuned classifier you own.
- Grounding check — every factual claim maps to a source in the retrieved context. No source, no claim.
Policies, not rules
Write your safety policies as a document a human can read and version. Your detectors implement the policy. When the policy changes, detectors update — once, traceably. Teams that skip the policy document end up with a graveyard of one-off regexes nobody remembers the purpose of.
Fail closed, communicate warmly
When the pipeline flags a response, the user gets a clear, branded refusal — not a system error, not a shrug. "We cannot answer that one — try rephrasing, or email us if it is urgent." Refusals are a design surface. Treat them like one.
“Security for LLMs is the same work it has always been: defence in depth, honest failure modes, and not pretending a single regex is a control.”
- Safety
- Security
- Guardrails
- LLM