RAG Architecture for Enterprise Data: Retrieval-Augmented Generation Explained

How Retrieval-Augmented Generation works, the data infrastructure requirements for enterprise RAG systems — vector stores, chunking strategy, embedding models, hybrid search — and how your existing data architecture affects RAG performance.

Retrieval-Augmented Generation (RAG) is the dominant architecture for enterprise AI applications that need to answer questions from an organisation's own data. Large language models are knowledgeable but their knowledge is frozen at training time and does not include your specific data — your contracts, your documentation, your product records, your customer history. RAG solves this by retrieving relevant context from your data at query time and providing it to the language model as input.

Understanding RAG architecture matters for data teams because RAG's performance is determined primarily by data quality, data structure, and retrieval system design — not by the language model itself. The best language model in the world produces poor results if the retrieval layer returns irrelevant or incomplete context.

How RAG Works

The RAG architecture has three main components:

**Ingestion pipeline.** Source documents are processed into a vector store. Processing involves: chunking (splitting documents into segments), embedding (converting each chunk into a vector representation using an embedding model), and indexing (storing the vectors in a vector database alongside the source text).

**Retrieval.** When a user asks a question, the query is embedded using the same embedding model, and the vector database performs a nearest-neighbour search to find the chunks whose vector representation is most semantically similar to the question. The top K results (typically 3–10 chunks) are returned as context.

**Generation.** The retrieved chunks are assembled into a prompt alongside the user's question and sent to the language model. The model generates an answer using both its parametric knowledge (what it learned during training) and the retrieved context. The answer is grounded in the retrieved context, reducing hallucination compared to generation without retrieval.

The quality of RAG output depends on: whether the relevant information exists in the corpus, whether the retrieval step returns the relevant chunks (not just semantically similar but irrelevant ones), and whether the language model correctly synthesises the context into a correct answer.

Data Infrastructure Requirements

**Vector databases.** A vector database stores high-dimensional vectors and supports efficient approximate nearest-neighbour (ANN) search. Options range from purpose-built vector databases (Pinecone, Weaviate, Qdrant, Chroma) to vector search capabilities in general databases (PostgreSQL with pgvector, MongoDB Atlas Vector Search, Redis Vector Search) to cloud warehouse native vector support (Snowflake with VECTOR type, BigQuery with VECTOR_SEARCH).

For organisations with an existing data warehouse on Snowflake or BigQuery, native vector support eliminates the need for a separate vector database — embeddings are stored alongside existing data, and retrieval queries run in the same system as other analytical queries. This simplifies architecture and avoids data synchronisation between systems.

**Embedding models.** Embedding models convert text into vectors. Model selection affects retrieval quality and cost. Common choices:

- **OpenAI text-embedding-3-large/small**: strong general-purpose embeddings, API-based, per-token pricing

- **Cohere embed-v3**: multilingual support, strong performance, API-based

- **Snowflake Arctic Embed**: optimised for retrieval tasks, available via Snowflake Cortex

- **Open-source models (BGE, E5, nomic-embed-text)**: self-hostable, no per-token cost, competitive performance

Embedding dimensionality matters: higher-dimensional embeddings (1536d, 3072d) capture more semantic nuance but require more storage and slower ANN search. For most enterprise use cases, 1536d embeddings (text-embedding-3-small) provide a strong tradeoff.

Chunking Strategy

Chunking — how you split documents before embedding — is one of the most impactful decisions in RAG architecture and one of the least discussed. Poor chunking causes retrieval failure even when the relevant information exists in the corpus.

**Fixed-size chunking**: split documents into chunks of N tokens (e.g., 512 tokens) with M tokens of overlap between chunks. Simple to implement. Works poorly when sentences or paragraphs span chunk boundaries, splitting relevant content across two chunks.

**Semantic chunking**: split at sentence or paragraph boundaries, grouping semantically related sentences together. Produces more coherent chunks but requires more sophisticated splitting logic. Tools like LangChain, LlamaIndex, and semantic-chunker implement this.

**Document-structure-aware chunking**: for structured documents (contracts, product documentation, support articles), split at document structure boundaries — headings, sections, articles, clauses. Each chunk represents a logical document unit. Retrieval returns the relevant section, not an arbitrary 512-token slice.

**Hierarchical chunking (parent-child)**: store small chunks for retrieval (higher precision) but return the larger parent chunk as context (more complete information). The retrieval finds the specific relevant passage; the context includes the surrounding section for completeness.

For enterprise document corpora with structure (product documentation, contracts, technical specifications), document-structure-aware chunking consistently outperforms fixed-size chunking. For unstructured text (emails, support conversations, freeform notes), semantic chunking typically outperforms fixed-size.

Hybrid Search

Pure vector search (semantic similarity) is not sufficient for all retrieval tasks. Questions that contain specific product codes, contract clause numbers, or exact names fail pure vector search because semantic similarity does not capture exact-match requirements.

**Hybrid search** combines vector search (semantic similarity) with keyword search (BM25 or full-text search) using a reranking step that merges results from both. The typical implementation:

- Run vector search against the embedding index → top K results

- Run BM25/full-text search against the text corpus → top K results

- Merge and rerank using a reranking model or reciprocal rank fusion

Reranking models (Cohere Rerank, BGE-reranker, cross-encoder models) take a query and a candidate result and score how well the result answers the query. Reranking on the merged candidate list from hybrid search consistently outperforms either pure vector or pure keyword search.

The Impact of Your Existing Data Architecture

RAG performance is highly sensitive to data quality issues that affect other analytical systems:

**Duplicate content.** If the same information appears in multiple documents with slight variations (multiple versions of a product specification, superseded contracts), retrieval returns all versions, and the language model generates confused or contradictory answers. Deduplication and version management in the ingestion pipeline is critical.

**Data freshness.** RAG corpora become stale when source documents are updated but the vector store is not re-indexed. Implement incremental indexing (detect changed documents, re-chunk and re-embed them) rather than nightly full re-indexing for large corpora.

**Structured data alongside unstructured.** Many enterprise questions require combining unstructured document retrieval with structured database queries ("What is the contract value for the customer with the highest support ticket volume?"). Architectures that combine RAG (for document retrieval) with Text-to-SQL (for structured data queries) — sometimes called agentic RAG — handle these queries but require careful orchestration.

**Metadata filtering.** Adding metadata to chunks at index time (document type, customer ID, date range, product line) enables pre-filtering before vector search — retrieving only from the relevant subset of documents. This dramatically improves precision for multi-tenant or domain-specific applications.

Evaluation and Quality Measurement

RAG quality should be measured systematically, not eyeballed. The standard evaluation framework:

**Faithfulness**: does the generated answer accurately reflect the retrieved context? High faithfulness means the model is not hallucinating beyond its context.

**Answer relevancy**: does the answer actually address the question asked?

**Context precision**: of the chunks retrieved, what fraction was actually relevant to the question?

**Context recall**: did the retrieval step return all of the relevant information in the corpus?

Frameworks like RAGAS, DeepEval, and LlamaIndex Evaluate automate these measurements, enabling regression testing when you change chunking strategy, embedding model, or retrieval parameters.

For data architecture that supports AI applications including RAG system design, our data architecture consulting team advises on AI-ready data infrastructure — see also our post on AI-ready data infrastructure — contact us to discuss your requirements.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →