Back to Blog
· by Tomasz Bartel

vge-promptguard-v2h: the end of fine-tuning for production guardrails

LLM SecurityPrompt InjectionGuardrailsCatastrophic ForgettingAIDR

Starting point

vge-promptguard-v1g is our production prompt injection detector: a DeBERTa-based encoder that performs well on classical attacks (jailbreaks, instruction override, role-play attacks).

Over the last few months two input types started dominating production traffic that v1g had never seen during training. First, agent tool outputs: tool call results, logs, JSON diffs, MCP server responses. v1g flagged them as attacks because formally they look like instructions. Second, Polish meta-conversations about security: user questions like "how does prompt injection work", educational content, fragments of our own documentation. v1g flagged these too.

On top of that, new attack vectors appeared (ASCII smuggling, indirect injection via RAG, multi-turn priming) that v1g had no exposure to.

First plan: fine-tune v1g on the new data. After six weeks we abandoned it.

vge-promptguard-v2h: the end of fine-tuning for production guardrails

Why fine-tuning v1g failed

We worked through the standard catastrophic-forgetting toolbox: LR schedules (cosine, warmup-decay, layer-wise LR decay), R-Drop and EWC regularization, knowledge distillation with v1g as teacher (KL on logits), LoRA with frozen base (rank 8, 16, 32), weight averaging across seeds (SWA, model soup over 5 and 10 runs), linear interpolation between v1g and fine-tuned weights, targeted oversampling for v1g's weak classes.

Across configurations the same pattern showed up: a 4 to 11 point F1 gain on the new distribution came with a 2 to 7 point F1 loss on the old one. The best model soups reduced regression but never to zero.

The reason is geometric. The optimum for the old distribution (v1g) and the optimum for the new distribution sit in different loss basins. A weighted average of weights from two basins lands in the valley between them, not in either basin. Model soups help at the margin, they do not solve the problem.

Operational conclusion: a single DeBERTa weight set, trained on both distributions at once, does not preserve v1g's full competence.

vge-promptguard-v2h architecture

v2h is two models plus one decision router.

v1g (base). Untouched. Production model, kept as the keeper of competence on classical attacks.

h_model (support). Same encoder as v1g, fine-tuned with LoRA rank 16 on the new distribution (agent outputs, Polish meta-conversations, new attack vectors). Stabilized by averaging 5 independent SWA runs. It does not replace v1g, it extends its field of view.

Router. Plain code, no gradient-trained parameters. Inputs: v1g confidence, h_model confidence, raw input text. Output: final label (attack or benign).

Router rule

The router has three branches.

1. Default to v1g. If none of the conditions below fire, the router returns v1g's decision. This is the default for most traffic and the reason v2h does not regress on the old distribution.

2. h_model override toward benign. Fires when h_model confidence_benign > 0.92 and v1g confidence_attack < 0.85. The first threshold prevents h_model from overriding v1g on a weak signal. The second threshold prevents h_model from ever drowning out a high-confidence v1g detection.

3. h_model add attack. Fires when h_model confidence_attack > 0.80 and v1g returned benign. Lower threshold than override, because added vigilance costs false positives, not missed attacks.

In the middle zone (h_model without a strong signal) h_model's vote does not enter the decision.

Practical consequence: on per-decision audit you can see exactly which model made the call and why. Two scores, one rule, no hidden meta-layer.

Second layer: pattern-aware calibration

After the first production week we spotted a specific input class where v1g returned confidence_attack > 0.95 but the inputs were genuine agent tool outputs: JSON responses from MCP servers and Python traceback fragments. Lowering the override threshold globally would handle this case, but would also let through real attacks formatted as pseudo-JSON.

Solution: a small regex/heuristic pattern detector (35 lines of code) that recognizes clear tool-output signatures (JSON with known keys, HTTP headers, Python tracebacks). When the pattern matches, the router uses thresholds { override 0.85, add 0.70 } instead of the defaults { 0.92, 0.80 }.

The pattern detector does not look at semantics, only at surface features. Every rule in it can be shown to a client during audit. By design it does not cover free-form text where an attack could be hidden contextually, so the threshold relaxation applies only to a narrow, unambiguously recognizable slice of traffic.

Results

Measured on a held-out test set combining the old v1g distribution (12k samples) and the new distribution (8k samples).

v2h costs two encoder passes instead of one. The pattern detector adds under 1 ms.

Metricv1gv1g fine-tunedv2h
F1 old distribution0.9470.8910.946
F1 new distribution0.6120.8640.873
Latency p5018 ms18 ms34 ms
Latency p9941 ms41 ms78 ms

What this means for the roadmap

The architecture is extensible by design. Any new attack surface, new language, or new tooling environment can be handled by adding another expert and another router branch, without touching v1g and without regression risk on the old distribution. This is the direction we will take vge-promptguard in upcoming versions.

Availability

vge-promptguard-v2h will not be released on Hugging Face for now. Our v1g model has drawn visible interest from several competing teams there, and v2h contains substantially more of our IP: the router construction, the threshold calibration, the pattern detector, the h_model training data. We are not going to make it easy to copy the new architecture quickly.

v2h will ship in the next Vigil Guard Enterprise 1.7.x release, available soon.