Why RAG Needs Multiple Retrieval Layers • Musings & Mutterings

Part 1: Beyond Vector Search - Why RAG Needs Multiple Retrieval Layers

This is Part 1 of a 4-part series on building production-ready, multi-layer RAG systems with Ragforge.

Part 1: Beyond Vector Search (you are here)

Part 2: The Multi-Hop Retrieval Pipeline

Part 3: Microservices Architecture

Part 4: From Development to Production

Introduction: The State of RAG in 2025

Retrieval-Augmented Generation (RAG) has revolutionized how we build AI applications that need to answer questions based on private or specialized knowledge. The basic recipe is simple:

Store your documents as vector embeddings
Find semantically similar chunks for a user’s query
Pass those chunks to an LLM
Generate an answer

This approach works remarkably well for straightforward factual lookups. Ask “What is photosynthesis?” and a vector search will find relevant biology textbook passages. The LLM synthesizes a clear answer.

But what happens when queries become more complex?

The Problem: When Vector Search Isn’t Enough

Consider this question about the Mahabharata epic:

“Who helped Arjuna after Day 10 of the war?”

A traditional RAG system using only vector search would:

Embed the question into a vector
Find the most semantically similar document chunks
Pass those chunks to the LLM
Hope the answer is in there

Why This Fails

This question requires understanding:

Temporal relationships: Events happening after a specific day
Social connections: Who was allied with whom
Causal chains: X happened because Y influenced Z
Multi-step reasoning: Finding information across multiple related documents

Vector search struggles here because:

Semantic similarity ≠ Logical relevance
- The most similar passages might mention “Arjuna” and “Day 10” but not discuss who helped him afterward
- A passage about Arjuna’s allies from Day 5 is semantically similar but temporally wrong
Timeline information is often implicit
- “The next day, Krishna advised…” - What day was that?
- Vector embeddings don’t capture temporal ordering well
Relationship context is distributed
- Arjuna’s allies are mentioned in one document
- Battle timelines are in another
- Specific aid after Day 10 is in a third
- No single chunk contains the complete answer
Causal chains require multi-hop reasoning
- Krishna helped because he was Arjuna’s charioteer
- Bhima helped because he was Arjuna’s brother
- This “because” reasoning isn’t captured by cosine similarity

A Real Example of Vector Search Failure

Let’s visualize what happens:

Query: "Who helped Arjuna after Day 10 of the war?"
Query Embedding: [0.23, -0.41, 0.89, ...]

Top 3 Results by Cosine Similarity:
─────────────────────────────────────────────────────
Rank 1 (score: 0.92)
"Arjuna was one of the greatest warriors in the Kurukshetra war.
On Day 10, he fought valiantly against the Kaurava forces..."

Problem: Mentions Arjuna and Day 10, but says nothing about who helped him afterward
─────────────────────────────────────────────────────
Rank 2 (score: 0.89)
"Krishna served as Arjuna's charioteer throughout the war,
providing strategic guidance and moral support..."

Problem: Mentions help but no temporal marker for "after Day 10"
─────────────────────────────────────────────────────
Rank 3 (score: 0.87)
"After Day 14, when Arjuna was exhausted, his brother Bhima
stepped forward to protect the Pandava position..."

Closer! Has temporal marker and helper, but it's Day 14, not Day 10
─────────────────────────────────────────────────────

The right answer might be buried at rank 47 because it mentions different names or uses different phrasing.

What We Actually Need

To answer complex questions, we need a system that can:

1. Understand Entity Relationships

Arjuna -[ALLY_OF]-> Krishna
Arjuna -[BROTHER_OF]-> Bhima
Arjuna -[PARTICIPATED_IN]-> Kurukshetra War

When we search for “Arjuna,” the system should know to also look at Krishna and Bhima’s activities.

2. Reason About Time and Sequence

Day 10: [Event A, Event B]
Day 11: [Event C] ← Search here for "after Day 10"
Day 12: [Event D]

The system needs to filter results based on temporal constraints, not just semantic similarity.

3. Traverse Multiple Hops of Information

Hop 1: Find documents about Arjuna (semantic search)
  ↓
Hop 2: Expand to his known allies (graph traversal)
  ↓
Hop 3: Filter for events after Day 10 (temporal filter)
  ↓
Result: "Krishna continued as charioteer, Bhima protected flanks"

4. Combine Multiple Evidence Sources

Semantic: “Arjuna needed support”
Symbolic: Krishna is linked to Arjuna via ALLY_OF relationship
Temporal: Document has metadata: {day: 11, after_day_10: true}
Synthesis: All three signals point to the right answer

Introducing Multi-Layer RAG

What if we could combine three complementary approaches?

The Three Pillars

┌─────────────────────────────────────────────────────────┐
│                    MULTI-LAYER RAG                      │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐ │
│  │   SEMANTIC   │  │   SYMBOLIC   │  │  MULTI-HOP   │ │
│  │              │  │              │  │              │ │
│  │   Vector     │  │  Knowledge   │  │  Iterative   │ │
│  │  Embeddings  │  │    Graph     │  │   Context    │ │
│  │              │  │              │  │  Expansion   │ │
│  │   pgvector   │  │    Neo4j     │  │              │ │
│  └──────────────┘  └──────────────┘  └──────────────┘ │
│                                                         │
└─────────────────────────────────────────────────────────┘

1. Semantic Layer (Dense Retrieval)

What: Traditional vector search using embeddings
When: First-hop recall, finding semantically similar content
How: Sentence transformers → pgvector → cosine similarity
Strength: Great for finding passages about the same topic

2. Symbolic Layer (Graph Reasoning)

What: Knowledge graph with entities and relationships
When: Understanding connections, hierarchies, and constraints
How: Entity extraction → graph database → traversal queries
Strength: Captures relationships that aren’t semantically similar

3. Multi-Hop Layer (Iterative Expansion)

What: Progressively expand context through multiple retrieval rounds
When: Answer requires information from multiple sources
How: Hop 1 (vector) → Hop 2 (graph) → Hop 3 (exact match)
Strength: Finds distributed information and synthesizes context

The Ragforge Approach

Ragforge is a production-ready implementation of multi-layer RAG that orchestrates all three layers intelligently. It’s a production grade RAG microservices system that’s easy to deploy and operate.

High-Level Architecture

                    ┌─────────────┐
                    │    Query    │
                    └──────┬──────┘
                           │
                           ▼
              ┌────────────────────────┐
              │  Query Understanding   │
              │  (Parse, Extract,      │
              │   Classify Intent)     │
              └────────────┬───────────┘
                           │
          ┌────────────────┼────────────────┐
          │                │                │
          ▼                ▼                ▼
    ┌──────────┐    ┌───────────┐    ┌──────────┐
    │ Semantic │    │ Ontology  │    │ Temporal │
    │Expansion │    │Expansion  │    │ Filtering│
    └─────┬────┘    └─────┬─────┘    └────┬─────┘
          │               │                │
          │               │                │
          └───────────────┼────────────────┘
                          │
                          ▼
                ┌──────────────────┐
                │   Multi-Hop      │
                │   Retrieval      │
                │                  │
                │  Hop 1: Vector   │
                │  Hop 2: Graph    │
                │  Hop 3: Exact    │
                └─────────┬────────┘
                          │
                          ▼
                ┌──────────────────┐
                │  Fusion &        │
                │  Reranking       │
                └─────────┬────────┘
                          │
                          ▼
                ┌──────────────────┐
                │  LLM Generation  │
                │  + Citations     │
                └──────────────────┘

Key Components

1. Data Pipeline Layer

Transforms raw text into multiple representations:

Document chunks for vector search (300-500 tokens)
Entity extractions (characters, locations, events)
Structured triples for the graph (subject-relation-object)
Metadata tags (timeline markers, entity types)

2. Ontology / Knowledge Graph Layer

Maintains relationships:

Nodes: Characters, Houses, Locations, Events, Battles
Edges: ALLY_OF, PARTICIPATED_IN, LOCATED_AT, HAPPENED_AFTER
Capabilities: Multi-hop traversal, relationship inference

3. Embeddings & Vector Store Layer

Dense semantic search:

Sentence-level embeddings
pgvector for ANN search
Metadata filtering
Hybrid search (vector + SQL)

4. Retrieval Orchestrator (The Brain)

Coordinates all layers:

Query parsing and intent detection
Semantic and ontology expansion
Multi-hop retrieval workflow
Fusion and reranking
Prompt construction

5. LLM Reasoning & Generation Layer

Grounded answer generation:

Context injection
Citation tracking
Reasoning traces
Safety filtering

6. UI & API Gateway Layer

User interaction:

REST API
Interactive UI
Debug views
Visualization panels

When Do You Need Multi-Layer RAG?

Not every application needs this complexity. Here’s a decision framework:

✅ You NEED Multi-Layer RAG When:

Complex relationships matter: “Who influenced whom?”
Temporal reasoning is required: “What happened after X?”
Causal chains are important: “Why did X lead to Y?”
Multi-document synthesis is needed: Answer spans multiple sources
Domain has structured knowledge: Medical ontologies, legal precedents, enterprise taxonomies
Accuracy is critical: Cost of wrong answers is high

Examples: Medical diagnosis support, legal case analysis, historical research, enterprise knowledge bases, technical troubleshooting

❌ You DON’T Need Multi-Layer RAG When:

Simple factual lookups: “What is X?”
Single-document answers: Everything is in one place
No relationships matter: Documents are independent
Speed over accuracy: Quick responses more important than perfect answers
No graph structure: Your domain doesn’t have clear entities and relationships

Examples: FAQ chatbots, simple documentation search, product catalog search, blog content retrieval

Cost/Benefit Analysis

┌─────────────────────────────────────────────────────────┐
│                  Complexity vs. Value                   │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  High │                        ● Multi-Layer RAG        │
│   V   │                       (High complexity,         │
│   a   │                        high value)              │
│   l   │                                                 │
│   u   │         ● Hybrid RAG                            │
│   e   │        (Medium complexity,                      │
│       │         medium value)                           │
│       │                                                 │
│       │  ● Simple Vector RAG                            │
│  Low  │ (Low complexity, low value)                     │
│       └────────────────────────────────────────────     │
│           Low          Medium          High             │
│                    Complexity                           │
└─────────────────────────────────────────────────────────┘

What’s Coming Next

In Part 2, we’ll dive deep into the retrieval engine and show you exactly how multi-hop retrieval works:

Complete query flow walkthrough with the Arjuna example
How semantic expansion works (with code)
How ontology expansion leverages graph traversal
The multi-hop retrieval algorithm step-by-step
Fusion and reranking strategies
Performance characteristics and benchmarks

You’ll see the actual queries, graph traversals, and fusion logic that makes multi-layer RAG work.

Key Takeaways

Vector search alone is insufficient for complex queries requiring relationships, temporal reasoning, or causal chains
Multi-layer RAG combines three approaches: semantic (vector), symbolic (graph), and multi-hop (iterative expansion)
Ragforge implements this pattern as a production-ready, scalable system
Not every application needs this complexity - use the decision framework to determine if multi-layer RAG is right for you
The payoff is accuracy and explainability - you can trace exactly where answers come from and why they’re relevant

iamthatdev/ragforge

Lorem ipsum dolor sit amet, consectetur adipiscing elit.

00K00KMIT

Continue the Series

Part 1 (you are here) → Part 2: The Multi-Hop Retrieval Pipeline

Coming Up Next: We’ll trace a complete query from start to finish, showing you exactly how the retrieval orchestrator coordinates semantic search, graph traversal, and multi-hop expansion to answer complex questions.