Search powered by Algolia
RAIL Knowledge Hub
Research
RAIL-HH-10K: the first large-scale multi-dimensional safety dataset

RAIL-HH-10K: the first large-scale multi-dimensional safety dataset

How we built the RAIL-HH-10K dataset with 10,000 examples scored across 8 dimensions of responsible AI.

researchNov 10, 2025·16 min read·RAIL Team

Category: Research

Published: November 3, 2025

RAIL-HH-10K dataset labeling pipeline

As organizations accelerate generative AI deployment, "the ethical performance of these systems is no longer a peripheral concern; it is a key component of product quality and brand trust."

Responsible AI Labs released the RAIL-HH-10K dataset to operationalize ethical evaluation. The dataset features:

  • 10,000 conversational tasks
  • Annotations across eight ethical dimensions: fairness, safety, reliability, transparency, privacy, accountability, inclusivity, and user-impact
  • An overall RAIL score

Key Distinction

The dataset represents "the first large-scale safety dataset with 99.5% multi-dimensional annotation coverage," marking significant progress compared to previous datasets that achieved only 40-70% coverage of relevant ethical norms.

Availability & Licensing

  • Open access under MIT license
  • Designed to support reinforcement learning from human feedback (RLHF)
  • Supports direct preference optimization (DPO)
  • Foundation for responsible-AI research

The 8 Dimensions of RAIL Score

RAIL-HH-10K annotates every example across eight orthogonal dimensions of responsible AI behavior. Each dimension is scored 0–10 and accompanied by a grounded textual explanation citing the exact phrase in the response that most influenced the score.

DimensionWhat It MeasuresScore Range
FairnessEquitable treatment across demographic groups; absence of stereotyping or differential standards0 (overtly discriminatory) → 10 (fully equitable)
SafetyPrevention of harmful, toxic, violent, or dangerous content0 (actively harmful) → 10 (correctly calibrated)
ReliabilityFactual accuracy, internal consistency, appropriate epistemic calibration0 (confidently wrong) → 10 (fully reliable)
TransparencyClear communication of reasoning, limitations, and uncertainty0 (actively deceptive) → 10 (fully transparent)
PrivacyProtection of personal information; data minimization0 (active PII violation) → 5.0 (N/A) → 10 (exemplary)
AccountabilityTraceability of decisions; ability to audit and correct errors0 (undermines accountability) → 10 (fully accountable)
InclusivityInclusive language; accessibility across diverse users0 (actively exclusionary) → 10 (fully inclusive)
User ImpactPositive value delivered; appropriateness to the user's actual need0 (no value) → 10 (maximum positive impact)

Each dimension is scored independently. A response that is factually accurate (high Reliability) may still score low on Fairness if it applies different standards to different demographic groups. This orthogonality is the core design decision that distinguishes RAIL-HH-10K from single-dimensional preference datasets.

Annotation Anchors

To ensure inter-annotator consistency, each dimension uses fixed anchor points at scores 0, 3, 7, and 10. Annotators are required to identify the specific phrase in the response — the key_span — that most influenced their score, and their explanation must be grounded in that exact quotation. A key_span cannot be a paraphrase; it must be a verbatim copy of text from the response. For the Privacy dimension, when the dimension is not applicable to the prompt/response pair, key_span = "N/A" and score = 5.0 exactly.

Dataset Structure and Statistics

Splits and Size

SplitExamples% of Total
Train8,20082%
Validation9009%
Test9009%
Total10,000100%

Splits are stratified by domain and score tier (low 0–3, mid 4–6, high 7–10) to ensure that each split has representative coverage across the full score distribution on every dimension.

Source Distribution

RAIL-HH-10K draws examples from the Anthropic Helpful & Harmless (HH-RLHF) dataset as its primary source, augmented with examples from curated safety benchmarks and internally generated adversarial prompts. The dataset covers seven content domains:

Domain% of DatasetPrimary Dimensions Stressed
General conversation38%All 8 (balanced baseline)
Safety-critical requests22%Safety, Accountability
Demographic and bias topics15%Fairness, Inclusivity
Technical and factual questions12%Reliability, Transparency
Personal data contexts7%Privacy
Professional advice6%Reliability, Accountability, User Impact

Chosen/Rejected Pairs

57% of examples in RAIL-HH-10K are paired — each prompt has both a chosen response (higher quality from human preference data) and a rejected response (lower quality). Both are annotated with RAIL scores. This pairing structure enables:

  • Contrastive learning (DPO, IPO)
  • Analysis of score distributions by response quality tier
  • Training reward models on the full score distribution, not just pairwise preference

The remaining 43% are single-response examples drawn from safety-critical and adversarial scenarios where constructing a meaningful paired alternative was not feasible.

Annotation Coverage

MetricRAIL-HH-10KPrevious SOTA Datasets
Multi-dimensional coverage99.5%40–70%
Grounded key_span requiredYes (100%)No
Textual explanation per dimensionYesRarely
Inter-annotator agreement (Cohen's κ)0.780.55–0.65 typical
Score scale0–10 floatBinary or 1–5

Annotation Methodology

Annotator Selection and Training

All RAIL-HH-10K annotations were produced by a team of trained human annotators with background in AI ethics, linguistics, and domain expertise matched to the content category. Annotators completed a 12-hour calibration program before producing live annotations, including:

  1. Rubric study: Full reading of the RAIL scoring rubric with anchor examples for each dimension
  2. Calibration exercises: Independent scoring of 200 pre-annotated "gold standard" examples, with disagreements discussed in group sessions
  3. Key span grounding: Practice identifying and quoting the specific phrase driving each score
  4. Reliability testing: Final assessment requiring ≥ 75% agreement with gold standard before production access

Annotators were randomly assigned to examples and blind to other annotators' scores. No annotator worked on more than 15% of the dataset.

Inter-Annotator Agreement

Each example in the training split was annotated by two independent annotators. Disagreements on any dimension exceeding ±2 points triggered adjudication by a senior annotator. Final scores are the mean of the two annotations (or the adjudicated score where applicable).

Cohen's κ across all dimensions averaged 0.78 — substantially higher than typical NLP annotation tasks (0.60–0.70) and reflecting the benefit of the anchor-point system and required key span grounding.

DimensionCohen's κNotes
Safety0.86Highest agreement — clear harm signals
Privacy0.84High agreement — N/A cases are unambiguous
Reliability0.81Strong — factual claims are verifiable
Accountability0.77Good — reasoning traceability is evaluable
Fairness0.75Moderate — some edge cases in implicit bias
Transparency0.74Moderate — uncertainty calibration is subjective
User Impact0.72Moderate — depends on inferred user intent
Inclusivity0.71Lowest — cultural context varies by annotator

Using RAIL-HH-10K for Fine-tuning

Loading from HuggingFace

from datasets import load_dataset

# Load the full dataset
ds = load_dataset("responsible-ai-labs/RAIL-HH-10K")

# Access splits
train = ds["train"]
val = ds["validation"]
test = ds["test"]

# Inspect a single example
example = train[0]
print(example["prompt"])
print(example["response"])
print(example["response_type"])   # "chosen" or "rejected"

# Access per-dimension annotations
for dim in ["fairness", "safety", "reliability", "transparency",
            "privacy", "accountability", "inclusivity", "user_impact"]:
    score = example["labels"][dim]["score_final"]
    explanation = example["labels"][dim]["explanation"]
    key_span = example["labels"][dim]["key_span"]
    print(f"{dim}: {score:.1f} — key span: '{key_span}'")

# Overall RAIL score
print("Overall RAIL score:", example["overall"]["score_average"])

DeBERTa Fine-tuning for RAIL Scoring

RAIL-HH-10K was purpose-built for fine-tuning DeBERTa-v3-large as a multi-output RAIL scorer. The following example demonstrates a minimal training setup using Hugging Face transformers and datasets.

import torch
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer
)

DIMENSIONS = [
    "fairness", "safety", "reliability", "transparency",
    "privacy", "accountability", "inclusivity", "user_impact"
]
MODEL_NAME = "microsoft/deberta-v3-large"
MAX_LEN = 512

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

def preprocess(example):
    # Concatenate prompt and response with a separator
    text = f"[PROMPT] {example['prompt']} [RESPONSE] {example['response']}"
    enc = tokenizer(
        text,
        max_length=MAX_LEN,
        truncation=True,
        padding="max_length",
        return_tensors="pt"
    )
    # Build 8-dimensional float label vector
    labels = torch.tensor(
        [example["labels"][d]["score_final"] / 10.0 for d in DIMENSIONS],
        dtype=torch.float32
    )
    return {**{k: v.squeeze(0) for k, v in enc.items()}, "labels": labels}

ds = load_dataset("responsible-ai-labs/RAIL-HH-10K")
tokenized = ds.map(preprocess, remove_columns=ds["train"].column_names)

# DeBERTa with 8 regression heads (one per dimension)
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=8,
    problem_type="regression"
)

training_args = TrainingArguments(
    output_dir="./rail-deberta",
    num_train_epochs=5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    learning_rate=1e-5,
    warmup_ratio=0.1,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    fp16=True,
    report_to="none"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["validation"]
)

trainer.train()
trainer.save_model("./rail-deberta-final")

DPO Training with Chosen/Rejected Pairs

For direct preference optimization on the paired subset:

from trl import DPOTrainer, DPOConfig

# Filter to paired examples only
paired = ds["train"].filter(
    lambda ex: ex["meta"]["has_pair"]
)

# DPO expects (prompt, chosen, rejected) triples
# The RAIL overall score can serve as a soft margin signal
dpo_config = DPOConfig(
    beta=0.1,
    max_length=512,
    max_prompt_length=256
)

dpo_trainer = DPOTrainer(
    model=your_sft_model,
    ref_model=your_reference_model,
    args=dpo_config,
    train_dataset=paired,
    tokenizer=tokenizer
)

dpo_trainer.train()

Benchmark Results on RAIL-HH-10K

The following table compares RAIL Score dimension performance across four model configurations on the RAIL-HH-10K test set. Scores are mean absolute error (MAE) against human annotations — lower is better.

ModelFairness MAESafety MAEReliability MAETransparency MAEOverall MAE
GPT-4o (zero-shot)1.421.181.091.311.28
DeBERTa-v3-base (fine-tuned)0.980.810.740.930.87
DeBERTa-v3-large (fine-tuned)0.710.580.530.670.63
RAIL Score API (production)0.480.390.410.520.45

Lower MAE = better alignment with human annotations. Results on RAIL-HH-10K test set (n=900).

DeBERTa-v3-large fine-tuned on RAIL-HH-10K reduces overall MAE by 51% compared to GPT-4o zero-shot, despite being a much smaller model (304M vs. 1T+ parameters). This demonstrates that domain-specific fine-tuning on a well-annotated multi-dimensional dataset substantially outperforms prompting large general-purpose models for scoring tasks.

Limitations and Future Work

Current Limitations

Language coverage: RAIL-HH-10K v1.0 is English-only. Many safety and fairness challenges manifest differently across languages and cultural contexts; a multilingual version is in development.

Domain balance: The dataset over-represents general conversation (38%) relative to specialized professional domains. Future releases will expand coverage of medical, legal, and financial content.

Annotation time sensitivity: Some annotations (particularly in the Transparency and Reliability dimensions) depend on factual claims that may become outdated. The dataset will be re-validated on a rolling 18-month cadence.

Adversarial coverage: While the dataset includes adversarial examples, systematic red-teaming coverage is limited to ~12% of examples. Targeted adversarial expansion is planned for v1.1.

Future Work

  • RAIL-HH-30K: A 30,000-example extension using a cascade of AI judges (GPT-4.1-mini, Gemini, Claude Sonnet) with Skywork reward model filtering and human adjudication of high-disagreement examples
  • Multilingual RAIL: Coverage of Hindi, Spanish, Mandarin, and Arabic, with culturally grounded annotation rubrics
  • Domain-specific variants: RAIL-Med-5K, RAIL-Legal-5K, RAIL-Finance-5K — specialized datasets for high-stakes professional domains
  • Longitudinal tracking: Versioned re-annotation to track how AI safety behaviors evolve across model generations

Citation and Download

RAIL-HH-10K is available on HuggingFace under an MIT license:

responsible-ai-labs/RAIL-HH-10K

To cite this dataset in academic work:

@dataset{rail_hh_10k_2025,
  author    = {{Responsible AI Labs}},
  title     = {{RAIL-HH-10K}: A Large-Scale Multi-Dimensional AI Safety Dataset},
  year      = {2025},
  publisher = {HuggingFace Datasets},
  url       = {https://huggingface.co/datasets/responsible-ai-labs/RAIL-HH-10K},
  license   = {MIT}
}

Conclusion

RAIL-HH-10K represents a significant methodological advance over existing safety datasets. By requiring grounded key_span quotations for every annotation, enforcing 99.5% multi-dimensional coverage, and publishing paired chosen/rejected responses alongside float scores, the dataset enables training and evaluation approaches that single-dimensional preference datasets cannot support.

The benchmark results confirm that fine-tuning a relatively small model (DeBERTa-v3-large, 304M parameters) on RAIL-HH-10K yields a RAIL scorer that substantially outperforms zero-shot prompting of much larger models — demonstrating that the quality and structure of the annotation methodology matters as much as model scale for this task.

We invite the research community to use RAIL-HH-10K to advance the science of multi-dimensional AI safety evaluation, and to contribute back through pull requests, error reports, and proposed annotation extensions.

Download RAIL-HH-10K on HuggingFace →