RAIL Knowledge Hub
Research
The RAIL AI Safety Index 2026: benchmarking 10 LLMs across 8 dimensions

The RAIL AI Safety Index 2026: benchmarking 10 LLMs across 8 dimensions

We benchmarked 10 frontier LLMs across four safety dimensions using Phare V2, HarmBench, Gray Swan, and MLCommons data. Bias resistance is the weakest link, safety improvements are stagnating, and single-attempt metrics dramatically understate real-world risk.

researchApr 9, 2026·24 min read·Anand Thakur

Of the 55 frontier language models evaluated by the Phare V2 benchmark in February 2026, not one scored above 90% in average safety. Bias resistance -- the dimension most directly tied to real-world harm -- remains below 65% for the majority of models tested. Meanwhile, 78% of organizations now use AI in at least one business function, but only 33% have responsible AI controls in place.


Key Takeaways

  • Anthropic's Claude 4.5 models sweep the top three positions on the Phare V2 safety leaderboard, but even the best model (Claude 4.5 Haiku, 83.2%) leaves significant room for improvement.
  • Bias resistance is the weakest safety dimension across nearly every model tested. Most score below 65%, and DeepSeek R1 0528 scores just 25.5%.
  • Single-attempt safety metrics are misleading. Gray Swan's multi-attempt testing shows Claude Opus 4.5 jumping from 4.7% to 63% attack success rate across 100 attempts in coding mode.
  • DeepSeek R1 failed to block a single harmful prompt in Cisco's HarmBench testing (100% attack success rate).
  • Enterprise AI adoption (78%) far outpaces responsible AI governance (33% with controls), creating systemic risk.
  • LLM safety improvements are stagnating -- improved reasoning does not correlate with better safety.

Introduction

The AI safety evaluation landscape in 2026 is fragmented, inconsistent, and structurally incomplete. Organizations deploying frontier language models face a paradox: more safety benchmarks exist than ever before, yet none can tell you whether a model is truly safe for your use case.

Stanford's HELM Safety project found that of 102 safety benchmarks published since 2018, only 12 were actually used to evaluate state-of-the-art models as of March 2024. MLCommons, the consortium behind the most widely cited enterprise safety benchmark (AILuminate), explicitly warns that "performing well on the benchmark does not mean your model is safe -- simply that we have not identified critical safety weaknesses." The benchmarks themselves acknowledge they cannot do what enterprises most need them to do.

This article assembles data from four distinct safety evaluation frameworks -- Phare V2, Cisco HarmBench, Gray Swan's multi-attempt red-teaming, and MLCommons AILuminate -- to construct a composite picture of where 10 frontier LLMs stand across multiple safety dimensions in April 2026.

The Phare V2 benchmark: the most current multidimensional safety leaderboard

The Phare benchmark from Giskard, developed in partnership with Google DeepMind, is the most current multidimensional safety leaderboard available. Its V2 update (February 2026) evaluates 55 models across four dimensions: Hallucination Resistance, Harm Resistance, Bias Resistance, and Jailbreak Resistance.

The key finding from V2: "LLM security improvements are stagnating" -- improved reasoning does not correlate with better safety. Safety, the Phare team concluded, "requires dedicated investment and engineering" and is "not an inevitable byproduct of model development."

The Phare V2 leaderboard

RankModelAvg SafetyHallucinationHarmBiasJailbreak
1Claude 4.5 Haiku83.2%83.6%99.9%70.7%78.5%
2Claude 4.5 Opus82.4%88.2%98.3%63.2%79.8%
3Claude 4.5 Sonnet77.6%87.0%99.1%49.1%75.2%
8Gemini 3.0 Pro73.3%81.0%93.5%53.7%65.1%
10GPT 5.172.8%81.8%96.9%46.8%65.8%
11GPT 5.271.0%77.1%96.9%38.5%71.6%
12Llama 4 Maverick70.8%71.5%89.3%73.7%49.0%
26DeepSeek V3.164.8%61.6%94.4%65.2%38.2%
38Mistral Large 360.9%68.0%88.1%62.7%24.9%
46DeepSeek R1 052858.6%72.9%95.2%25.5%40.7%

Anthropic models sweep the top three positions. A striking pattern: bias resistance is the weakest dimension for nearly every model, with most scoring below 65%. Jailbreak resistance shows the greatest provider-to-provider variance -- Anthropic models cluster around 75--80%, while Mistral Large 3 sits at just 24.9%.

Phare V2 safety scores across 4 dimensions

Cisco HarmBench: single-shot adversarial testing

Cisco's HarmBench testing (January 2025, 50 prompts) provides a direct cross-model comparison of attack success rates:

ModelASR
DeepSeek R1100%
Llama 3.1-405B96%
GPT-4o86%
Gemini-1.5-Pro64%
o1-preview26%
Claude 3.5 Sonnet26%

DeepSeek R1 failed to block a single harmful prompt. Separate testing by Enkrypt AI found R1 is 11x more likely to generate harmful content than OpenAI o1, with 83% of bias attacks and 78% of insecure-code attacks succeeding. Promptfoo testing gave DeepSeek R1 a 53.5% security pass rate, Llama 4 Scout just 21.7%, and Llama 4 Maverick 25.5%.

Jailbreak attack success rate comparison

Gray Swan: multi-attempt red-teaming reveals hidden risk

The Gray Swan benchmark reveals a critical flaw in standard safety evaluation: most benchmarks test a single prompt, once. In real-world adversarial scenarios, attackers try repeatedly. Anthropic published a 153-page system card using 200-attempt RL campaigns.

ModelASR (1 attempt)ASR (10)ASR (100)
Claude Opus 4.5 (coding)4.7%33.6%63.0%
Claude Opus 4.5 (computer use + extended thinking)0%0%0% at 200
GPT-5.121.9%----
Gemini 3 Pro12.5%----

Claude Opus 4.5 in computer-use mode with extended thinking became the first model to saturate the benchmark at 0% ASR even after 200 attempts. But in coding mode, the same model jumps from 4.7% to 63% across 100 attempts -- demonstrating that safety is not a fixed property of a model but a function of deployment configuration.

Multi-attempt ASR degradation

MLCommons AILuminate: the human-generated standard

MLCommons AILuminate (v1.0 December 2024, v1.1 February 2025) uses a five-tier grading system (Poor to Excellent) across 12 hazard categories with 24,000+ human-generated prompts. Key findings:

  • No model received "Excellent." Claude 3.5 Sonnet and Mistral Large 2402 (moderated) received "Very Good." GPT-4o, Gemini 2.0 Flash, and Llama 3.1 405B received "Good."
  • The December 2025 jailbreak benchmark v0.5 introduced resilience-gap testing against template-, encoding-, and optimization-based jailbreaks.
  • MLCommons's official statement: "Performing well on the benchmark does not mean your model is safe -- simply that we have not identified critical safety weaknesses."

Why binary benchmarks fail

The data across all four frameworks points to six structural problems with binary safety evaluation:

  1. Negative predictive power only. Every major benchmark explicitly disclaims that passing does not equal safe. HELM Safety v1.0: "is not able to designate models as safe -- can only identify ways in which models may be unsafe."

  2. Coverage gaps. Of 102 safety benchmarks published since 2018, only 12 were used to evaluate SOTA models (Stanford HELM Safety). Multi-turn interactions, agentic behavior, multimodal inputs, non-Western languages, and deployment context remain uncovered.

  3. Consequence-blindness. Models detect lexical patterns, not actual consequence risk -- causing both jailbreak vulnerability and over-refusal from the same root cause.

  4. Fine-tuning nullifies alignment. FAR.AI demonstrated DeepSeek R1 guardrails are "illusory and easily removed" via jailbreak-tuning. This applies to all fine-tunable models.

  5. Stagnation. Phare V2 concluded safety "requires dedicated investment and engineering" and is "not an inevitable byproduct of model development."

  6. GCG-Transfer effect. HELM Safety found model scores are on average 25.9% worse when evaluated with automated red-teamed prompts rather than standard benchmark prompts.

Additional benchmarks and model-specific notes

HELM Safety (Stanford CRFM, v1.0 November 2024)

Tests 5 benchmarks spanning 6 risk categories across 24 prominent models. Uses BBQ, SimpleSafetyTest, HarmBench, XSTest, and AnthropicRedTeam. Statement: "HELM Safety v1.0 is not able to designate models as safe -- can only identify ways in which models may be unsafe."

Other evaluation frameworks

  • DecodingTrust: 8 perspectives (toxicity, stereotypes, privacy, machine ethics, fairness, adversarial robustness, OOD robustness, adversarial demonstrations)
  • SafetyBench (Tsinghua, ACL 2024): 11,435 MCQs across 7 safety categories, Chinese + English
  • JailbreakBench (NeurIPS 2024): 200 behaviors, standardized attack evaluation
  • HarmBench: Up to 510 behaviors across 4 functional and 7 semantic categories
  • ICLR 2025 adaptive attacks (Andriushchenko et al.): Nearly 100% ASR on GPT-3.5/4, Llama-2-Chat, Gemma-7B using logprob-based random search. All Claude models jailbroken via transfer or prefilling with 100% success.

Model-specific safety notes

  • Claude 4.5 Sonnet: 98.7% safety score; first model to never engage in blackmail in alignment testing; harmful request compliance <5% failure rate; false positive rates fell 10x.
  • GPT-5.1: "Safe completions" feature (helpful responses rather than outright refusals); deception rate 2.1% vs 4.8% for o3.
  • Llama 4 Maverick: Aggregate ASR of 49%; system prompt leak resistance only 36.56% blocked (Protect AI).
  • UK AISI challenge: 1.8 million attacks across 22 models -- every model broke, with ASR ranging 1.47--6.49%.

The enterprise governance gap

Enterprise data underscores why these benchmark results matter at organizational scale:

  • 78% of organizations now use AI in at least one function, up from 55% in 2023 (McKinsey "State of AI" 2025, 1,993 respondents across 105 nations).
  • 23% are scaling agentic AI; 39% experimenting (McKinsey).
  • Only 33% of companies have responsible AI controls despite 75% integrating AI (EY 2025).
  • Only 1 in 5 companies has mature governance for autonomous AI agents (Deloitte State of AI in Enterprise 2026, 3,235 leaders).
  • 42% of companies abandoned most AI initiatives in 2025, up from 17% in 2024.
  • $1.5 trillion worldwide AI spending in 2025 (Gartner).
  • 70--85% of AI initiatives fail to meet expected outcomes (MIT/RAND).

The gap between adoption velocity and governance maturity means the majority of organizations deploying frontier LLMs lack the infrastructure to detect or mitigate the specific dimensional failures identified in this analysis.

Why RAIL's 8-dimension framework addresses these gaps

RAIL's approach to safety evaluation -- scoring across Fairness, Safety, Reliability, Transparency, Privacy, Accountability, Inclusivity, and User Impact -- was designed specifically to address the limitations exposed in this analysis. Where Phare V2 evaluates four dimensions and HarmBench tests a single adversarial axis, the RAIL Score Evaluator provides an 8-dimension profile that maps to the actual risk categories enterprises face in deployment.

Organizations using the RAIL Score Evaluator can test their specific models against their specific use cases and receive per-dimension scores that directly inform risk assessment. This is the difference between "this model scored 7.2 out of 10" and "this model scores 8.9 on Safety but 4.1 on Fairness, which is critical for your hiring use case."

Conclusion

The composite safety picture across 10 frontier LLMs reveals a field where harm prevention is strong, bias resistance is weak, jailbreak vulnerability varies dramatically by provider, and single-attempt metrics systematically understate real-world risk. No model achieves comprehensive safety across all dimensions, and improved reasoning capability does not translate into improved safety.

For organizations deploying these models, the practical takeaway is clear: safety evaluation must be multidimensional, deployment-specific, and ongoing. A single benchmark score cannot capture the dimensional complexity of LLM safety. Evaluating against multiple axes -- and understanding which dimensions matter most for a given use case -- is the minimum standard for responsible deployment in 2026.