CLaRa: Apple's Framework that Revolutionizes Retrieval-Augmented Generation

The Problem of Traditional RAG Systems

Context and Motivation

Retrieval-Augmented Generation (RAG) has become a fundamental paradigm for enhancing Large Language Models (LLMs) with external knowledge. By integrating evidence from document databases, RAG systems mitigate critical issues such as hallucinations and knowledge obsolescence. However, the traditional architecture of RAG systems presents significant structural limitations that compromise their efficiency and effectiveness.

The Two Fundamental Challenges

Challenge 1: Retrieval-Generation Misalignment. In conventional RAG systems, the retriever selects documents based on superficial similarity (typically cosine similarity in the embedding space), while the generator produces responses without providing feedback on which information is actually needed. This disjoint optimization creates a fundamental "broken gradient problem": since document selection is discrete, gradients from the generator cannot flow back to the retriever.

Challenge 2: Computational Inefficiency. Dense retrievers operate in the embedding space, while generators still consume raw text. This architectural mismatch produces:

Inconsistent representation spaces that hinder end-to-end optimization
Redundant processing of text that increases inference costs and causes context overflow
Double encoding for both retrieval and generation

The Key Insight of CLaRa

CLaRa proposes an elegant solution: to use shared continuous representations for retrieval and generation. Instead of maintaining separate embeddings and raw text, documents are encoded once into compact "memory token" representations that serve both purposes. This unification simultaneously addresses both challenges: continuous representations make the retrieval process differentiable, while joint training aligns both modules in a shared semantic space optimized for reasoning.

CLaRa Architecture

Framework Overview

CLaRa is structured around a three-stage training process, each designed to progressively build the system's capabilities. The architecture is based on a shared LLM (Mistral-7B or Phi-4B in the experiments) equipped with multiple LoRA adapters for modular control.

StageNameObjectiveStage 1SCP - Salient Compressor PretrainingPretraining of the compressor with QA supervision and paraphrasingStage 2Compression Instruction TuningFine-tuning for downstream QA tasksStage 3End-to-End Training (CLaRa)Joint training of reranker and generator

Stage 1: Salient Compressor Pretraining (SCP)

The first stage addresses a critical limitation of existing methods: token-level reconstruction-based approaches tend to waste capacity on superficial patterns rather than preserving high-level semantics. SCP introduces a data synthesis framework that explicitly highlights salient information through QA and paraphrasing.

2.2.1 Data Synthesis Pipeline

Using 2 million documents from Wikipedia-2021, a local LLM (Qwen-32B) generates three complementary forms of supervision:

Simple QA: Question-answer pairs that capture individual atomic facts, encouraging fine-grained factual retention. The model is guided to extract distinct facts not covered by previous questions.
Complex QA: Pairs that integrate multiple facts, promoting relational reasoning and high-level abstraction. The model connects previously unlinked facts.
Paraphrasing: Paraphrased documents that reorder sentence structure, altering the superficial form while preserving core semantics. Learning this mapping through an informational bottleneck forces the representations to focus on semantics.

Compressor Architecture

Given a document d_i = {t₁, ..., t_m}, l learnable memory tokens (m₁, ..., m_l) are added. The hidden states of the last layer of the memory tokens form the compressed representation M_i.

The total loss function combines two components:

Cross-Entropy Loss (L_CE): Supervises the generation of QA/paraphrases from the compressed representations.
MSE Loss (L_MSE): Aligns the average hidden states of the document and memory tokens, ensuring that the compressed representations faithfully reflect the original semantics.

L_total = L_CE + λ·L_MSE

Stage 2: Compression Instruction Tuning

The pre-trained compressor is general-purpose. To adapt it for downstream QA and obtain a response generator that understands the compressed representations, instruction tuning is performed using downstream training datasets where retrieved documents are paired with task instructions. The target output is generated by a teacher model conditioned on the same documents and instructions.

Stage 3: End-to-End Training

The Broken Gradient Problem

The joint training of retrieval and generation requires addressing a fundamental problem: the top-k document selection is a discrete operation that interrupts the flow of gradients. CLaRa resolves this through a Differentiable Top-k Selector implemented with a Straight-Through (ST) Estimator.

Straight-Through Estimator for Top-k Selection

The mechanism works like a "soft lens": it preserves the discrete retrieval behavior during inference while allowing smooth gradient feedback during training. Given the cosine similarities s = [s₁, ..., s_D]:

Forward Pass: Hard top-k selection (Z_hard) - standard discrete behavior
Backward Pass: Softmax distribution (Z_soft) allows gradients to flow through the retriever
Combination: Z = Z_hard + (Z_soft - SG(Z_soft)), where SG is the stop-gradient operator

Query Reasoner

A critical component of CLaRa is the Query Reasoner (θ_qr), a LoRA adapter initialized by the compressor that represents queries in the same space and with the same number of memory tokens as the document representations. Through Next-Token Prediction (NTP) training, the query reasoner learns not only to encode the intent of the query but also to anticipate relevant document content.

By analyzing the tokens decoded from the query representations through a logit lens, researchers found that the query reasoner incorporates reasoning-relevant keywords that do not appear in the original query but are present in the gold documents. This represents a form of implicit query expansion operating in continuous latent space.

Gradient Coupling Analysis

The authors provide a rigorous theoretical justification for why learning from NTP produces stronger and more stable learning signals. When reranking and generation share the same representations, p(y|x,d) depends on the retrieval score s_xd, allowing gradients from the generator to flow into the reranker.

The gradient coupling produces two complementary learning signals:

The retriever is rewarded for ranking correct documents higher through probabilistic alignment
It is guided to represent documents in a way that facilitates the generator's reasoning through representation-level feedback

Experimental Results

Experimental Setup

Dataset: The experiments were conducted on four QA benchmarks: NQ (Natural Questions), HotpotQA, MuSiQue, and 2WikiMultihopQA, covering both single-hop and multi-hop reasoning.

Base Models: Mistral-7B and Phi-4B, with models released on Hugging Face: CLaRa-7B-Base, CLaRa-7B-Instruct, and CLaRa-7B-E2E.

Settings: Two evaluation configurations:

Normal: top-5 documents retrieved from Wikipedia-2021
Oracle: gold document included in the top-5, to isolate the quality of compression from retrieval noise

Compression Performance

Table: Compression Performance Comparison (Normal Setting, Mistral-7B)

ModelCRNQHotpotQAMuSiQue2WikiAvgMistral-7B w/ BGE1x54.5842.948.9444.2437.67LLMLingua-24x47.5337.059.0244.3534.49PISCO16x54.3941.9410.0944.8837.83CLaRa (Ours)4x57.0545.0910.3446.9439.86CLaRa (Ours)16x55.5643.7210.5546.0038.96CLaRa (Ours)32x54.6443.5210.5546.5838.82

CR = Compression Rate

Compression Results Analysis

The results reveal several significant insights:

Surpassing full-text baselines: CLaRa outperforms the baseline Mistral-7B with BGE retrieval using uncompressed documents, with average gains of +2.36% over Mistral-7B. This suggests that well-trained compressed representations can filter out irrelevant content and focus the generator on context relevant for reasoning.
Robustness across compression rates: Performance remains stable from 4x to 32x, peaking at 16x-32x for most datasets.
Gains vs. soft compression SOTA: Compared to PISCO (the best soft compression baseline), CLaRa achieves average gains of +1.13% (Normal) and +5.35% (Oracle).

End-to-End Performance

In the end-to-end evaluation, CLaRa-Mistral-7B with a compression ratio of 16x surpasses DRO-Mistral-7B (baseline text-based SOTA):

NQ: F1 from 51.01 → 51.41
2Wiki: F1 from 43.65 → 47.18

In the Oracle setting, F1 performance exceeds 75% on both NQ and HotpotQA, demonstrating that joint optimization effectively leverages accurate retrieval.

Retrieval Performance

A particularly surprising result concerns retrieval performance. CLaRa, trained only with weak supervision from the generation loss (without explicit relevance labels), outperforms supervised retrievers trained with ground-truth labels.

On HotpotQA with a compression ratio of 4x:

CLaRa Recall@5: 96.21%
BGE-Reranker Recall@5: 85.93%
Gain: +10.28%

This demonstrates that weak supervision from generation is sufficient to learn deep semantic correlations between queries and documents.

Technical Implementation Details

Technology Stack

Framework: Built on OpenRLHF, an open-source framework for RLHF
Core Dependencies: PyTorch >= 2.0, Transformers >= 4.20, DeepSpeed >= 0.18, Flash Attention 2
Distributed Training: Support for ZeRO Stage 2, multi-node/multi-GPU training
Precision: bfloat16 for memory efficiency

Repository Structure

The GitHub repository (github.com/apple/ml-clara) is organized as follows:

├── scripts/                      # Training and evaluation scripts
│   ├── train_pretraining.sh     # Stage 1
│   ├── train_instruction_tuning.sh  # Stage 2
│   ├── train_stage_end_to_end.sh    # Stage 3
│   └── evaluation_end_to_end.sh
├── openrlhf/
│   ├── models/modeling_clara.py  # Model definition
│   ├── datasets/sft_dataset.py   # Dataset management
│   └── trainer/sft_trainer.py    # Training utilities
├── evaluation/                   # Evaluation framework
└── example/                      # Example data

Data Formats

Pretraining (Stage 1):

{
    "data_type": "qa",
    "question": ["Question 1"],
    "answers": ["Answer 1"],
    "docs": ["Document 1"]
}

End-to-End Training (Stage 3):

{
    "question": "Single question text",
    "docs": ["Document 1", "Document 2", "..."],
    "gold_answer": "Reference answer"
}

Key Training Parameters

ParameterStage 1-2Stage 3Note max_len20481024Sequence length learning_rate1e-45e-6-compress_rate4-256x4-128xFlexible doc_max_length256256Per document generation_top_k55Top-k docs

Practical Usage Guide

Installation

# Create conda environment
conda create -n clara python=3.10 -y
conda activate clara

# Install dependencies
pip install -r requirements.txt

# Setup path
export PYTHONPATH=/path/to/clara:$PYTHONPATH

Inference with Pre-trained Models

Three models are available on Hugging Face, each for a different use case:

ModelUse CaseCLaRa-7B-BaseBase semantic compressionCLaRa-7B-InstructQA from compressed representationsCLaRa-7B-E2EJoint retrieval + generation

Example usage of CLaRa-7B-E2E:

from transformers import AutoModel

model = AutoModel.from_pretrained(
    "apple/CLaRa-7B-E2E",
    trust_remote_code=True
).to("cuda")

# 20 candidate documents
documents = [[
    "Document 1 content...",
    "Document 2 content...",
    # ... up to 20 documents
]]

questions = ["Your question here"]

# Generate answer with internal retrieval and reranking
output, topk_indices = model.generate_from_questions(
    questions=questions,
    documents=documents,
    max_new_tokens=64
)

print(f"Answer: {output[0]}")
print(f"Selected document indices: {topk_indices}")

Example CLaRa-7B-Instruct (without internal reranking):

from transformers import AutoModel

model = AutoModel.from_pretrained(
    "apple/CLaRa-7B-Instruct",
    trust_remote_code=True
).to("cuda")

documents = [[
    "Document 1...",
    "Document 2...",
    "Document 3..."
]]

questions = ["Your question here"]

# Generate answer from already selected documents
output = model.generate_from_text(
    questions=questions,
    documents=documents,
    max_new_tokens=64
)

print(f"Answer: {output[0]}")

Implications and Future Directions

Impact on Industry

CLaRa represents a significant step towards more efficient and accurate RAG systems. The 16x-128x compression with preserved or improved performance has direct implications for:

Reducing inference costs: Fewer tokens to process = less compute
RAG on edge devices: Compact representations enable on-device deployment
Scaling to larger knowledge bases: Without increasing memory requirements

Recognized Limitations

Generalization of the compressor: Currently pre-trained only on Wikipedia; requires adaptation for other domains (code, technical documents, etc.)
Model scale: Experiments limited to 7B and 4B; larger models may produce higher quality representations
Reasoning on compressed representations: The paper does not explore integration with agentic frameworks or advanced multi-hop RAG

Future Directions

The authors identify several promising directions:

Domain-adaptive pretraining with diverse corpora (e.g., code, legal documents)
Integration into RAG reasoning-oriented frameworks like Search-R1
Extension to tool learning and multimodal systems
Linking implicit understanding and implicit reasoning (latent reasoning)

Conclusions

CLaRa represents a fundamental advancement in the architecture of RAG systems. By unifying retrieval and generation in a shared continuous representation space, it simultaneously addresses:

The optimization problem: Enabling end-to-end learning through differentiable selection
The efficiency problem: Eliminating redundant encoding and reducing context length

Empirical results validate this architectural choice: achieving state-of-the-art reranking performance without explicit relevance supervision demonstrates that the quality of generation provides sufficient learning signal for retrieval. The 16x-128x semantic compression with preserved performance suggests that effective reasoning in RAG systems may not depend on long contexts, but rather on a unified latent reasoning space.

For the Italian AI community and companies like Algoretico, CLaRa opens concrete opportunities: from reducing infrastructure costs to enabling more sophisticated RAG applications in enterprise contexts. The open-source code and models available on Hugging Face make this technology immediately accessible for experimentation and deployment.

References

He, J., Bai, R.H., Williamson, S., Pan, J.Z., Jaitly, N., & Zhang, Y. (2025). CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning. arXiv:2511.18659.
GitHub Repository: https://github.com/apple/ml-clara
Hugging Face Models:
Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020.
Louis, A., et al. (2025). PISCO: Memory Token Compression for RAG. ACL 2025.