Published on April 12, 2026

VGE PromptGuard v1g: Open-Weight Prompt Injection Detection Model

What Is VGE PromptGuard v1g

VGE PromptGuard v1g is the core detection engine running inside Vigil Guard Enterprise. It classifies text inputs as SAFE or INJECTION in real time, protecting LLM-powered applications from direct attacks, jailbreaks, and indirect injections hidden in code, tool outputs, and user-supplied content.

The model is built on DeBERTa v3 Small architecture: 44 million parameters, 6 layers, 768 hidden dimensions. Small enough to run inference on CPU in real time. Precise enough to outperform models an order of magnitude larger on prompt injection benchmarks.

We are releasing it on Hugging Face under CC BY-NC 4.0 for research and non-commercial use. The production version ships as part of Vigil Guard Enterprise with full preprocessing pipeline, ONNX optimization, and managed deployment.

VGE PromptGuard v1g: Open-Weight Prompt Injection Detection Model

Built for Agentic AI Systems

Most prompt injection detectors analyze user messages in isolation. That works when the user types directly into a chatbot. It falls apart when AI agents start calling external tools, processing API responses, and executing multi-step workflows.

VGE PromptGuard v1g was trained on code from the Masked Language Model phase onward. It understands source code structure, tool output formats, and function return patterns. The model ships with a 3-tier preprocessing pipeline that extracts payloads after APIResults, ToolOutput, and FunctionReturn markers, catching injections embedded deep inside agentic workflows.

On the LLM-PIEval benchmark (750 samples of agentic injection scenarios), VGE PromptGuard v1g achieves a 97.5% detection rate with the payload extraction pipeline.

Performance vs Size

Tested on Protect AI's own validation dataset (3,227 samples across 7 splits). The same test set, the same label mapping, independently reproducible. VGE PromptGuard v1g delivers 2x higher overall F1 while maintaining a false positive rate below 1% on production-representative benign traffic.

The FPR difference matters in practice. A model that flags 42.5% of benign prompts as injections is unusable in production. Every false alarm erodes trust and creates operational noise. At 0.0% FPR on the gold benchmark and under 2.2% on ToxicChat, VGE PromptGuard v1g operates without alert fatigue.

Metric	VGE PromptGuard v1g	Protect AI v2 (base)
Overall F1	0.934	0.452
Direct injection F1	0.981	—
INJ Recall (direct)	96.7%	—
FPR (benign prompts)	0.0%	42.5%
Agentic detection	97.5%	—
Parameters	44M	86M

Language and Code Coverage

Primary language support covers English and Polish, both trained on native corpora, not machine-translated. Polish business prompts, tool-use scenarios, and conversational exchanges are classified with near-zero over-defense (2.5% FPR on the Polish benchmark).

The model also handles code-context classification. Trained on source code during the MLM domain adaptation phase, it detects prompt injections hidden inside code snippets, function definitions, and inline comments. On the BIPIA code benchmark, VGE PromptGuard v1g achieves 100% recall (the base model scores 0%).

Extended coverage for 5+ additional languages is available through the VGE platform's content moderation layer.

How to Use It

The model is available at huggingface.co/VigilGuard/vigil-llm-guard. It supports PyTorch inference, ONNX Runtime for optimized CPU deployment, and the full preprocessing pipeline for production use with long prompts and agentic contexts.

For production deployments, VGE PromptGuard v1g runs inside Vigil Guard Enterprise with automatic scaling, SIEM integration, and managed policy enforcement. Install VGE with a single command and the model is deployed, configured, and monitored out of the box.