Published June 15, 2026· by Tomasz Bartel

Vigil Guard 1.8.x across nine public prompt-injection benchmarks

LLM SecurityPrompt InjectionBenchmarksRAG SecurityAIDR

What we tested

We are wrapping up testing of 1.8.x. Before it goes to production, here is how it does on public benchmarks, because that is the one result anyone outside can reproduce on their own.

We ran 1.8 against nine public, external prompt-injection benchmarks. We used no internal Vigil Guard dataset. Every result is raw 1.8 detection, and each one links to its source. Measured on the single-turn POST /v1/guard/input contract, block threshold score ≥ 40, build 1.8.0-lab, June 14, 2026.

The sets split into four scenarios: indirect injection hidden in content (emails, tables, code, RAG documents), classic direct injection against system instructions, jailbreaks paired with benign prompts written close to the attack boundary, and sets that measure over-defense, meaning how often the detector blocks legitimate text.

Indirect injection and RAG

1.8 is strongest on indirect attacks, the ones where the instruction is never typed by the user but sits in the content the model processes.

Microsoft's BIPIA is 17,614 indirect attacks planted in emails, tables and code, the largest set in this report. 1.8 detects 98.9% of them.

indirect-pia-detection (Chen et al., ACL 2025) is a peer-reviewed benchmark where the attack hides inside a long RAG document built from SQuAD and TriviaQA, and the detector has to pull it out of genuine, on-topic text. Recall is 99.2%, and 99.7% (896 of 899) on the SQuAD split at 0.6% false positives. That is exactly the setup enterprise RAG pipelines run on: catch the planted instruction, leave the genuine document alone.

On direct attacks (instruction override, system-prompt extraction, behaviour redirection) pangea-aiguard-lab gives 95.3% recall over 1,433 samples. JailbreakBench is 100% (124 of 124).

Benchmark	Type	N	1.8 result
BIPIA (Microsoft)	indirect: email / table / code	17,614	98.9% recall
indirect-pia-detection (ACL 2025)	indirect / RAG	3,394	99.2% recall
pangea-aiguard-lab	direct + benign	1,433	95.3% recall
JailbreakBench	jailbreak	124	100% recall
deepset/prompt-injections	direct (benign)	56	0.0% FP
NotInject	over-defense (benign)	339	2.6% over-block

Clean on benign

The other half of the job is not blocking legitimate traffic.

On deepset/prompt-injections 1.8 has 0.0% false positives, 0 of 56 benign prompts. Ordinary prompts pass untouched. On the SQuAD split of indirect-pia false positives are 0.6%.

NotInject is built specifically for over-defense: every prompt is benign but seeded with the words naive detectors latch onto. Here 1.8 has 2.6% over-block, zero on Common Queries and zero on multilingual prompts. The rate only climbs with trigger-word saturation: 0.9% at one, 1.8% at two, 5.3% at three.

Against other guards

These numbers say the most next to what someone would deploy instead of 1.8. We compare like-for-like only: same dataset, same kind of metric.

On NotInject (accuracy is 1 − FPR) 1.8 has 97.4% and ranks second of nine systems, behind LlamaGuard-3 (99.71%), ahead of Lakera Guard (87.61%), GPT-4o (86.73%) and ProtectAI v2 (56.64%). With one caveat: LlamaGuard-3 buys that cleanliness with around 39% recall on indirect attacks, so it lets most real attacks through.

On indirect-pia 1.8 does 99.2% recall without being trained on the set, on par with detectors that were trained on it (97 to 99%), and well above off-the-shelf guards: Meta's Prompt-Guard (39.5 to 86%) and LlamaGuard-3 (≤ 39.1%).

The pattern is clear: most systems in these tables sit on one side of a trade-off, sensitive but over-defensive, or clean but weak on indirect attacks. 1.8 holds both ends at once.

What 1.8 does not catch yet

Two scenarios are clearly weaker for us. Black-box indirect injection (Amazon's llm-pieval) is 37.6%, and web-agent injection (WAInjectBench) is 29.6%. In both, the payload is crafted with no knowledge of the detector, and that is the hardest, newest attack class. It is where the next iteration goes.

One false-positive point stays open: clean TriviaQA documents in indirect-pia push the FPR to 30.5%. That sits inside the over-defense range reported for other detectors on the same set, but we want it lower.

When

We ship 1.8.x soon, together with the full benchmark report. Every number in it reproduces from its linked public source: BIPIA, indirect-pia-detection, JailbreakBench, pangea-aiguard-lab, deepset, NotInject, llm-pieval, WAInjectBench.