Session 02Sunday, May 4 · 7:00 PMTonight

Context, Memory & Why AI Forgets

Session 1 ended by planting a question: how do companies build AI that actually remembers things? This session answers it completely - RAG, vector databases, memory patterns, and the real cost of giving AI a memory.

Duration~2.5 hours (can split)

FormatRead → Try → Reflect

Pre-readingSession 01 - How LLMs Actually Work

CourseAI for Enterprises

🔗

Where We Left Off

5 min

Session 1 ended with a planted question: “If every new chat starts from zero, how do companies build AI systems that actually remember things?” That question is this session's entire agenda - and by the end, you'll have the complete answer.

Here is what Session 1 established: models are stateless, weights are frozen at training, and the context window is the only “memory” a model has for any given request. When you close the tab, everything disappears. The model doesn't carry forward a single word.

The pre-session assignment asked you to map a repetitive task to an AI opportunity. That task is your working example for this entire session - the documents it would need, the context it would require, the memory it would have to persist. Keep it in mind.

What we established

Models are stateless next-token predictors. Weights are frozen. Context window is working memory. Chat history is text re-injected into every new call - not recollection.

The question this raises

If the model is stateless and context resets every call, how does any AI product appear to “know” you, retain documents, or have access to company data? That gap is an engineering problem - not a model capability.

What this session covers

The three ways engineers give AI memory. RAG mechanics end-to-end. Vector databases and embeddings. Memory design patterns for enterprise products. The real cost of context injection.

Why it matters for your work

Every enterprise AI product that uses your company's data must solve the memory problem. Understanding how gives you the ability to design solutions - not just use ones others have built.

🔄

Why AI Has No Memory

8 min

Statelessness is not a bug in LLMs. It is the design. Every API call is completely independent. No knowledge, no context, no history flows between sessions unless you explicitly engineer it to do so.

Recall the frozen weights point from Session 1: at inference time, the model's parameters don't change. Nothing about your conversation affects the underlying model. The weights that existed when you started are identical when you finish.

There's a critical distinction most people miss: “the model doesn't remember” is a model property. “The system doesn't persist” is an engineering choice. You cannot change the first. You absolutely can change the second - and that's what the rest of this session is about.

When you use a product that “remembers” your name, your preferences, or your past questions - that product is injecting stored context into the current prompt. The model isn't recalling anything. A database retrieved your profile, and it was pasted into the context window before your message was processed. What looks like memory is text injection.

Myth

“"Claude knows who I am - after we've talked before, it remembers me."”

Reality

Stateless by design. Products that appear to know you are injecting stored context from a database into your current session. The model has never seen you before.

Myth

“"If I correct an AI's mistake, it learns from that correction."”

Reality

The correction exists only in the current context window. Start a new conversation, ask the same question - same mistake is possible. Nothing was learned; nothing was updated.

Myth

“"AI tools get better the more your team uses them."”

Reality

Only if you are actively fine-tuning the model or updating a knowledge base. Passive use alone does not improve the model. Volume without feedback loops is just cost.

Myth

“"Longer conversations mean AI understands you better."”

Reality

Only within the current context window. The more a conversation grows, the more of the beginning gets truncated. Start fresh and it's completely zero again.

The engineering distinction that changes everything

“The model forgets” is a model property. “The system doesn't persist” is an engineering choice. You can't change the model. You can change the system.

📐

The Context Window - Your Working Memory

8 min

Session 1 introduced the context window. This session goes deeper - because understanding how it fills, what gets dropped, and why position matters is the foundation of every memory engineering decision you'll make.

Think of the context window as working memory for one request. Everything the model knows, for this specific call, must live inside it. What doesn't fit simply does not exist as far as the model is concerned. No warning. No error. Silent truncation.

Session 2 context anatomy - a typical RAG-powered production call~200K tokens (Claude Sonnet 3.5)

System Prompt 15%

User Profile 10%

Retrieved Docs 25%

Conv. History 30%

User Msg 8%

Available 12%

System prompt

User profile/memory

Retrieved documents (RAG)

Conversation history

User message

Remaining space

What fills the context and in what order

System prompt first (static, set by developer). Then injected context - user profile, retrieved documents from RAG. Then conversation history (all previous turns, oldest first). Then the current user message. The model sees all of this simultaneously.

What gets dropped when it overflows

Most APIs silently truncate the oldest content - typically the beginning of the conversation history, then the beginning of the earliest injected document. The system prompt is usually protected. The model never warns you. You simply lose context.

Why position matters (primacy/recency)

Models pay disproportionate attention to content near the start and end of the context window. This is called primacy and recency bias. Critical instructions belong in the system prompt or close to the user message - not buried in the middle of a large document.

How to audit your own context window usage

Log the full prompt you're sending on each call. Measure token counts per segment. Track how usage grows with conversation length. Most teams don't do this until the first surprising invoice or the first truncation-caused failure.

128K

GPT-4 context window

Roughly 96,000 words - about two full-length novels. Sounds huge until you're injecting large documents plus conversation history.

200K

Claude Sonnet context window

~150,000 words. Enables full-document reasoning, but larger windows cost proportionally more per call.

Gemini 1.5 context window

Experimental ultra-long context. Performance quality degrades in the middle of extremely long contexts - the "lost in the middle" problem.

Bigger is not always better

A million-token context window sounds like it eliminates the memory problem. It doesn't. You still pay per token. You still get primacy/recency bias. You still need to curate what goes in. Bigger windows reduce the urgency of good context management - they don't replace it.

🧱

Three Ways to Give AI Memory

10 min

There are exactly three approaches to making an AI system retain information across a session, across sessions, or across all users. They differ in cost, complexity, freshness, and what kind of knowledge they handle well. Understanding all three - and when each is right - is the core design skill this session builds.

Simplest

Layer 1: In-Context Injection

Paste the relevant data directly into every prompt. The model sees it because it's in the context window - not because it remembers it.

Works for: static facts, user preferences, small documentsFails when: data changes, volume exceeds context windowCost: you pay for injected tokens on every single call

↓

Enterprise Standard

Layer 2: Retrieval Augmented Generation (RAG)

Store documents as embeddings. At query time, retrieve only the most relevant chunks and inject them. The model sees relevant context without you injecting everything.

Works for: large knowledge bases, proprietary data, frequently changing documentsFails when: you need perfect recall, data is highly structured (use a DB instead)Cost: one embedding call per document at ingest, cheap vector search at query time

↓

Most Powerful

Layer 3: Fine-tuning

Update the model's weights with your domain data. The knowledge becomes baked into the model itself - not injected at runtime.

Works for: stable domain knowledge, consistent task patterns, specialised vocabularyFails when: data changes frequently - model goes stale immediately after trainingCost: training compute (high) + evaluation + retraining cadence

The order most teams get backwards

Most enterprise teams reach for fine-tuning first and RAG second. This is backwards. RAG is faster to deploy, cheaper to run, keeps data fresh, and can be audited - you can see exactly what was retrieved and why. Fine-tune only when RAG genuinely can't solve it.

📅

The Knowledge Cutoff Problem - and Its Solution

6 min

Session 1 introduced the knowledge cutoff. Now that you understand the three memory approaches, you can see clearly why it matters and how to solve it.

The cutoff means the model genuinely does not know about anything that happened after its training data was collected. But the more practical problem for most enterprises isn't news events - it's proprietary data. Your internal documents, last month's pricing, updated policies, customer records: the model has never seen any of it.

The naive solution - paste the document into the prompt - works for one document, one question. It fails immediately when you have hundreds of documents, when the right document isn't obvious, or when the document is longer than available context. RAG is the scalable solution.

What the model genuinely doesn't know

Your internal documents and policies. Your current pricing. Post-cutoff news and regulations. Anything proprietary or non-public. Internal processes and institutional knowledge. Client histories and account data.

What RAG gives it

Current knowledge - as fresh as your last document ingestion. Specific documents retrieved on demand. Proprietary data without retraining the model. Real-time injected context with source attribution.

Practical rule

If the information is proprietary, recent, or specific to your organisation, it must be injected. Don't rely on the model's training data for anything your business depends on. Treat the model as a reasoning engine, not a source of truth.

⚙️

RAG: How Retrieval Actually Works

12 min

RAG - Retrieval Augmented Generation - is the standard enterprise architecture for giving AI access to proprietary or current information without retraining. Here is the complete pipeline, step by step.

Document ingestion

You upload your documents - PDFs, pages, data, anything text-based. They get split into chunks: paragraph-sized pieces, roughly 256–512 tokens each. Chunking is an art. Too small loses context. Too large dilutes relevance. Each chunk will be retrieved independently, so it needs to be self-contained enough to be useful alone.

Embedding

Each chunk gets converted into a vector - a list of numbers (typically 768–1,536 dimensions) that encodes its meaning. This is done by an embedding model, separate from the generative model. Anthropic, OpenAI, and Google all offer embedding APIs. The result: each chunk becomes a point in high-dimensional space, where similar meaning = nearby location.

Storage in vector database

The embedding vectors and original text chunks are stored together in a vector database. The vector is for searching (fast similarity computation). The text is for retrieving (injected into the prompt). This is your knowledge base - persistent, searchable, updatable without retraining anything.

Query time - user asks a question

The user's question is also embedded using the same embedding model. Now you have a query vector - a point in the same high-dimensional space as your document chunks. Finding relevant documents means finding nearby points.

Similarity search

The vector database compares the query vector to all stored document vectors. It finds the 3–10 most similar chunks - the ones whose vectors point in similar directions to the query vector. Similar direction means similar meaning. This happens in milliseconds, even across millions of documents.

Retrieval and injection

The top-k most similar chunks are retrieved as text and injected into the prompt as context: "Based on the following documents: [retrieved chunks]. Answer the question: [user query]." The model now has access to the relevant information for this specific request.

Generation

The model generates a response grounded in the retrieved context - not just its training data. If the context is good, the response is accurate and specific. The source chunks can be cited in the response, enabling auditability.

What the prompt looks like - before and after RAG injection

✗

WITHOUT RAG - prompt only contains training knowledge

System: You are a policy assistant for Acme Corp.
User: What is the refund policy for enterprise contracts?

→ Model answers from training data. Generic. Possibly wrong.

✓

WITH RAG - retrieved chunk injected into context

System: You are a policy assistant for Acme Corp.

Context [from Policy Doc v3.2, updated 2026-03-01]:
"Enterprise contract refunds are processed within 14 business
days of written cancellation request. Refunds apply pro-rata
to unused months. Setup fees are non-refundable..."

User: What is the refund policy for enterprise contracts?

→ Model answers from the actual, current policy document.

The key insight

RAG doesn't make the model smarter. It makes the model's context richer. The model is still doing the same thing - predicting the best next token. RAG just ensures the most relevant information is in the window when it does. The intelligence is in the retrieval; the generation is the same as always.

🧮

Embeddings and Similarity Search

8 min

Session 1 introduced the idea that tokens become vectors - that “King − Man + Woman ≈ Queen” is a real result in vector space. Embeddings are the foundation of RAG retrieval. Here is the complete picture.

An embedding is a list of numbers - typically 768 to 1,536 floating-point values - that encode the meaning of a piece of text. The embedding model is trained on massive amounts of text to learn that texts with similar meaning should produce similar numerical vectors. The numbers themselves have no direct interpretation - only their relationships to each other matter.

Similarity is measured using cosine similarity: how similar are the directions of two vectors in this high-dimensional space? Two identical texts have cosine similarity of 1.0. Unrelated texts have cosine similarity near 0. Semantically related but differently-worded texts score somewhere in between - which is what makes semantic search work.

Why does this beat keyword search? Because it captures meaning, not just word overlap.

Keyword search

Finds documents containing the exact search terms. Fails when the user asks in different words than the document uses. “Cardiac arrest protocol” would not find a document titled “Heart attack procedures.” “Staff reduction policy” would miss “Employee termination guidelines.”

Semantic / embedding search

Finds documents with similar meaning, regardless of exact words. Searches for “heart attack” and retrieves documents about “cardiac arrest”. Searches for “staff reduction” and finds “employee termination.” Works because both pairs map to nearby vectors in embedding space.

A distance visualisation. Imagine a simplified 2D version of embedding space. A query about “vacation leave policy” produces a vector. In the document store:

Cosine similarity - query vs documents

0.92

HR Policy Doc - Annual Leave and PTO Guidelines

Similarity: Very close - Would be retrieved

0.84

Employee Handbook - Time Off and Absence Management

Similarity: Close - Would be retrieved

0.61

Payroll Policy - Holiday Pay and Overtime

Similarity: Related - Borderline (depends on k)

0.12

IT Security Policy - Password Requirements

Similarity: Unrelated - Would not be retrieved

Embedding quality determines retrieval quality

The embedding model is one of the most important architectural choices in a RAG system - and most teams don't think about it until retrieval starts failing on domain-specific terminology. A general embedding model may not understand your industry's specialised vocabulary well. Test retrieval quality on your own documents before committing to an embedding model.

🗄️

Vector Databases in Practice

8 min

A vector database stores embedding vectors alongside their source text, and enables fast similarity search across millions of entries. When your RAG system asks “what documents are most similar to this query?”, the vector database answers that question - typically in 10–100 milliseconds.

What makes a vector database different from a regular database: regular databases are optimised for exact lookups (“find the row where ID = 4821”). Vector databases are optimised for approximate nearest-neighbour search (“find the 10 vectors most similar to this query vector”). The underlying algorithms - HNSW, IVF, PQ - are specialised for this problem.

The vector database landscape

Pinecone - Fully managed, fast, scalable. Best-in-class latency and developer experience. More expensive at high volume. Good default choice if you want zero infrastructure overhead.

Weaviate - Open-source with managed option. Hybrid search (keyword + vector) built in. Strong for structured data with filters. Good for enterprise knowledge bases where you need both semantic and exact-match queries.

pgvector - Postgres extension. If you're already on Postgres, this is zero infrastructure cost. Lower performance at massive scale. Perfect for teams that want to stay in their existing database without adding a new service.

Chroma - Open-source, excellent for development and prototyping. Not production-ready at enterprise scale but the fastest way to stand up a local RAG system for testing.

Qdrant - Open-source, very fast filtering on metadata, strong on-premise story. Good choice for organisations with data sovereignty requirements or air-gapped infrastructure.

10–100ms

Typical retrieval latency

Managed vector databases return results in milliseconds even across millions of documents. This latency adds to your total response time - budget for it.

1,536

OpenAI Ada-2 vector dimensions

Each chunk becomes a list of 1,536 numbers. Cohere: 1,024. Sentence-transformers: 768. More dimensions = more precise, more storage.

$0–300/mo

Cost for small-medium workloads

pgvector is free. Chroma is free. Pinecone and Weaviate managed tiers start around $70–$300/month for small workloads. Scales with index size and query volume.

How to choose

Start with pgvector if you're already on Postgres. Start with Chroma for development. Graduate to Pinecone or Weaviate when you need scale, hybrid search, or enterprise SLAs. Never pick a vector database before you know your retrieval volume and whether you need exact-match filtering alongside semantic search.

🔀

Four RAG Patterns for Enterprise

10 min

Not all RAG implementations are equal. There are four patterns, each with different complexity, cost, and retrieval quality. Start simple and graduate as your requirements demand it.

Naive RAG

Ingest documents, chunk, embed, store. At query time: embed query, retrieve top-k chunks, inject into prompt, generate. Simple, fast to implement, works for most straightforward use cases. Fails on: complex multi-hop questions, long documents where key information is in summary-level context not in individual chunks.

Advanced RAG

Adds three improvements: better chunking (recursive or semantic, not just fixed-size), query expansion (rewrite the user's question in multiple ways to improve retrieval recall), and reranking (use a cross-encoder model to re-score the top-k results by true relevance before injection). Significantly better recall. Adds 50–200ms latency and moderate cost.

Hybrid RAG

Combines vector (semantic) search with traditional keyword (BM25) search. Best of both worlds: semantic search finds meaning-similar documents, keyword search handles exact matches for product codes, names, IDs, and technical terms that don't embed well. The current best practice for enterprise knowledge bases with mixed content types.

Agentic RAG

The model itself decides what to retrieve and when. Rather than a single fixed retrieval step before generation, the model can issue multiple retrieval calls mid-response, synthesise across sources, and verify its own answers. Enables complex multi-document reasoning. Requires more complex orchestration. Covered in depth in Session 4.

The progression rule

Start with Naive RAG. Measure retrieval quality on real queries. Only move to Advanced or Hybrid when you can identify specific failures Naive RAG can't handle. Complexity should be earned by actual retrieval failures - not added pre-emptively. Agentic RAG is for when you need multi-hop reasoning that can't be pre-specified.

💬

Managing Conversation Memory

8 min

Session 1 introduced conversation memory briefly. Now that you understand the full context picture, here are the four production patterns - and when each one is appropriate.

Sliding window

Keep only the last N turns of conversation history. Drop the oldest exchange as each new turn arrives. Simple to implement and effective for task-completion sessions where recent context matters most.

history = history.slice(-N) // Keep last N turns Risk: user stated preferences at turn 1 are dropped when turn N+1 arrives.

Conversation summarisation

When history exceeds a threshold, call the LLM to compress older turns into a dense summary. Inject the summary instead of the full history going forward. Reduces token cost by 80–95% while preserving semantic content.

Summarization prompt: "Summarise the following conversation in 3–5 sentences, preserving key facts, decisions made, and user preferences stated: {old_history}" Output: 200–400 tokens instead of 3,000+.

Structured memory extraction

After each turn, extract and store key facts as structured data - user name, stated preferences, confirmed decisions. Inject as a compact "User Profile" block at session start instead of raw history. More expensive to maintain, but extremely targeted.

Extracted profile: { name: "Sarah", role: "Procurement Lead", preference: "formal tone", ongoing_issue: "Q2 vendor compliance", last_decision: "extend deadline to Jun 15" }

External memory stores

Store conversation summaries and user facts in a database - not the vector database, but a standard key-value or relational store. Retrieve relevant history on demand when a session starts. Scales to millions of sessions without per-session context cost.

On session start: retrieve user_profile from DB → inject into system prompt. After session ends: extract key facts → write to DB. No full history is ever stored in context.

Choosing the right pattern

For short task-completion sessions (under 10 turns): sliding window. For extended work sessions (10–50 turns): summarisation. For long-running relationships - returning users over days or weeks: structured extraction + external store. The pattern should match the session lifetime, not the most impressive architecture.

🧹

Context Hygiene - What Goes In Must Earn Its Place

6 min

Every token in your context window costs money and competes for the model's attention. Raw documents injected without pre-processing are almost always the wrong approach. A 20-page PDF cleaned down to 4 relevant paragraphs will outperform the raw PDF every time.

Context hygiene is not optimisation - it is correctness. A model given clean, relevant context consistently outperforms the same model drowning in noise. Preprocessing is the highest-leverage, lowest-cost improvement available in any RAG system.

Fixed-size chunking

Splits at token count boundaries. Simple to implement. Risks cutting sentences mid-thought. Best for: uniform documents with even information density - regulatory text, structured forms, standardised templates.

Recursive chunking

Splits by sentences first, then paragraphs, then sections - respecting semantic boundaries. Preserves meaning within each chunk. Best for: prose documents, emails, reports, policy documents - anything written in natural language.

Semantic chunking

Uses embedding similarity to detect topic shifts. Splits when meaning changes significantly - not at arbitrary token counts. Best for: long documents covering multiple topics (a 50-page handbook, an annual report, a multi-section policy).

Metadata tagging

Every chunk gets metadata: source document, page number, section title, date, author. Metadata enables filtering ("only search last 30 days"), attribution ("answer based on [document]"), and freshness tracking.

What to strip before ingesting: headers and footers (every page says “CONFIDENTIAL - Page 12 of 47”), repeated boilerplate, table of contents (it duplicates structure without adding content), excessive whitespace, page numbers.

What to keep: the substantive content, section titles (important for context), document title and date as metadata, author or source attribution.

The highest-leverage improvement most teams skip

Pre-processing is the highest-leverage, lowest-cost improvement in any RAG system. A 20-page PDF cleaned down to 4 relevant paragraphs outperforms the raw PDF every time. Before debugging retrieval quality, debug your preprocessing pipeline.

⚖️

RAG vs Fine-tuning vs Prompting - The Decision Matrix

10 min

These are not alternatives in a hierarchy - they are tools for different problems. The decision of which to use depends on four factors: what kind of knowledge is needed, how frequently it changes, what volume of calls you're running, and what your control requirements are.

Prompting alone

When: task is general, data fits in prompt, volume is low, no proprietary knowledge needed. Cost: generation tokens only. Freshness: only as fresh as model training. Control: high - you control every word. Best for: personal productivity, general writing, exploration.

RAG

When: large knowledge base, proprietary documents, data changes frequently, citations needed. Cost: embedding at ingest + vector search + injected chunk tokens. Freshness: as fresh as your last document ingestion. Control: high - you choose what to retrieve and what to inject.

Fine-tuning

When: stable domain vocabulary, consistent task patterns, style that cannot be prompted, model needs to internalise a specific behaviour. Cost: training compute + retraining cadence. Freshness: only as fresh as your last training run. Control: medium - you cannot inspect or directly edit weights.

Your company releases quarterly pricing updates

RAG / Prompting approach

RAG - index the pricing document. Update it quarterly. Fresh on next query.

Fine-tuning consideration

Fine-tuning - model goes stale in 3 months. Retraining costs repeat every quarter.

You want the model to always respond in a specific legal-review tone

RAG / Prompting approach

Prompting - well-crafted system prompt + examples handles tone reliably.

Fine-tuning consideration

Fine-tuning - valid if tone requirements are very specific and prompting consistently fails.

Your team has 5,000 internal policy documents

RAG / Prompting approach

RAG - index all 5,000 documents. Retrieve relevant ones per query. Scalable.

Fine-tuning consideration

Fine-tuning - cannot fine-tune 5,000 documents effectively. Data changes anyway.

You need a model that formats medical reports in your exact clinical style

RAG / Prompting approach

Prompting + examples - works for most formatting tasks.

Fine-tuning consideration

Fine-tuning - valid here. Stable format, specialised vocabulary, consistent patterns.

Your product needs to answer questions about this week's product inventory

RAG / Prompting approach

RAG - sync inventory data to knowledge base daily. Always current.

Fine-tuning consideration

Fine-tuning - completely wrong here. Data changes daily. Model would be stale immediately.

The most expensive enterprise AI mistake

Fine-tuning a model with company data that changes quarterly. Three months later, the model is stale. You pay for retraining. Meanwhile, a RAG system would have been updated continuously for near-zero incremental cost - just re-ingest the new documents.

🧪

Try It Yourself - Three Exercises

15 min

Each exercise takes 5–10 minutes and demonstrates something you cannot fully grasp from reading alone. No code required for any of them.

Exercise 1 - Simulate RAG yourselfNo tools needed

Take 5 paragraphs from any document you work with - a policy, a process doc, an FAQ. Label them 1–5.

For each paragraph, write one sentence summarising what it is about. This simulates what an embedding encodes - the meaning of each chunk.

Write a question someone in your organisation might ask about this document. Write it without looking at the paragraphs.

Now, without re-reading the full text: which paragraph(s) would best answer the question? That selection process is similarity search - done manually. You just retrieved the most relevant chunk.

Count the tokens in those paragraphs using tiktokenizer.vercel.app. That is your injected context cost for this query. Multiply by your daily query volume.

Exercise 2 - Test the memory limitclaude.ai or any LLM

Start a fresh conversation. In your very first message, include: "My name is [your name], I work in [department], my main challenge this quarter is [X]."

Have a conversation for 5–6 turns on a completely different topic - ask for help with something unrelated.

Ask: "What do you remember about me from the beginning of our conversation?" Note how accurately it recalls.

Now start a completely new conversation - fresh tab. Ask the same recall question without providing the context first.

Observe: within a session, context from turn 1 persists. Across sessions, it is completely gone. That is statelessness you can feel directly.

Exercise 3 - Map your use case to a memory patternThe task from Session 1

Take the repetitive task you identified before this session. Write it at the top of a piece of paper or doc.

Ask: does this task require the AI to know your company's documents, policies, or data? If yes → RAG. Write "RAG" next to it.

Ask: does this task require the AI to remember context across multiple days or sessions - user preferences, ongoing projects, past decisions? If yes → External memory store.

Ask: does this task require the AI to behave in a very specific domain-adapted way that cannot be achieved through prompting? If yes → Fine-tuning candidate.

Write down which memory pattern(s) your task needs. This is your architecture note. Keep it - it becomes your starting point for Session 3.

🏢

Enterprise Memory Design

8 min

Enterprise AI products typically need all three memory tiers simultaneously. Understanding which tier handles which need prevents the common mistake of trying to solve everything with a single RAG index.

Tier 1 - Session

Session Memory - within one conversation

What: conversation history for this specific session, user's stated context, in-progress task state. Pattern: sliding window or summarisation. Duration: one session - closes when tab closes.

Customer support chatDocument Q&A toolTask completion assistant

↓

Tier 2 - User

User Memory - across sessions for one user

What: user profile, preferences, role, department, past decisions, ongoing projects. Pattern: structured extraction → user profile store → inject on session start. Duration: persistent until user updates.

Personal AI assistantInternal copilotManager productivity tool

↓

Tier 3 - Knowledge

Knowledge Base Memory - shared across all users

What: company documents, policies, product data, process guides, FAQs. Pattern: RAG with vector database. Duration: updated on document ingestion - no model retraining required.

Internal policy searchProduct Q&ACompliance assistantOnboarding guide

Design question 1 - Does each session need memory of past sessions?

If yes: build a user memory tier. Extract facts from each session, store them in a user profile database, retrieve and inject on session start. If no: session memory alone is sufficient - simple sliding window or summarisation.

Design question 2 - Does the AI need company-specific knowledge?

If yes: build a knowledge base tier. Index your documents in a vector database. Retrieve relevant chunks at query time via RAG. If no: the model's training data may be sufficient - use prompting alone.

Design question 3 - Does the AI need to recall what happened in this conversation?

Always yes. Every AI product needs session memory - at minimum, the current conversation history. The question is only which management pattern to use: sliding window, summarisation, or structured extraction.

💰

The Cost of Memory

8 min

Memory has a token cost. Every technique discussed in this session - RAG, conversation history, user profiles - adds tokens to every API call. At low volume this is trivial. At enterprise scale it becomes the dominant line item. Know the numbers before you design.

~$0.0001

RAG retrieval cost per query

Vector search itself is cheap: ~$0.0001 per query for managed solutions. The cost comes from the tokens you inject after retrieval.

1,200

Tokens added per RAG call (3 chunks × 400 tokens)

A typical RAG retrieval adds 1,200–2,000 tokens to every call. At $3/M input tokens and 1,000 queries/day, that's ~$1.08/day - $32/month. Scale to 10,000/day and it's $320/month.

95%

Token savings: summarised vs full history

Naive full history at turn 10: ~8,000 tokens. Summarised history: ~400 tokens. Summarisation is not just cheaper - at long conversations, it becomes mandatory to avoid truncation.

Chunk size tuning

Smaller chunks mean more precise retrieval - you inject only the most relevant 200-token paragraph, not a 1,000-token page. Test retrieval accuracy at multiple chunk sizes before committing. Precision beats volume every time.

Top-k optimisation

Start with top-3 retrieved chunks. Only expand to top-5 or top-10 if retrieval quality genuinely requires it. Each additional chunk adds tokens to every call - multiplied by your daily query volume.

Query caching

Identical or near-identical queries can return cached results without a new embedding call or retrieval step. Customer support scenarios often see 20–40% of queries as repeats. Cache those and the RAG overhead drops proportionally.

Async embedding at ingest

Batch-embed documents in off-peak hours - not at query time. Embedding at query time adds latency and cost per request. The ingestion pipeline should run asynchronously: documents get embedded when they're uploaded, not when they're first queried.

⚠️

Five RAG Mistakes and How to Fix Them

8 min

These are the five patterns that consistently produce bad RAG output. Each one has a specific, practical fix.

Mistake 1 - Injecting full documents instead of chunks

Fix

The mistake

"Upload 40-page contract. Inject entire document on every query about the contract."

40 pages at 400 tokens/page = 16,000 tokens per call. Most of it irrelevant to any specific question. Buries the relevant clause. Costs are extreme at scale.

The fix

Chunk the document into 400-token sections. Embed each section. At query time: retrieve the 3 most relevant sections (1,200 tokens total). Inject only those. Cost drops 93%. Relevance improves.

Mistake 2 - Using the wrong chunk size

Fix

The mistake

"Chunks of 50 tokens (too small)" or "chunks of 2,000 tokens (too large)"

50-token chunks lose sentence context - retrieved chunks are incomplete thoughts. 2,000-token chunks dilute relevance - too many topics in one chunk, cosine similarity is averaged across all of them.

The fix

256–512 tokens is the tested starting point for prose documents. Benchmark retrieval accuracy at multiple sizes against 20–30 representative queries. Tune from evidence, not intuition.

Mistake 3 - Never updating the knowledge base

Fix

The mistake

"Index documents once at project launch. Never update."

Six months later: the model is answering questions about outdated policies, old pricing, superseded procedures. No one knows when the data went stale. Trust erodes gradually then suddenly.

The fix

Set an ingestion schedule: daily for high-change content (pricing, inventory), weekly for moderate-change (policies, guides), monthly audit for stable content. Treat the knowledge base like a living document, not a one-time migration.

Mistake 4 - Not attributing sources

Fix

The mistake

"Model answers from retrieved chunks but provides no indication of which document it used."

When the answer is wrong - and eventually it will be - there is no audit trail. Users cannot verify. Errors are invisible. Compliance requirements may mandate traceability.

The fix

Inject document title + date with every chunk: "[Source: HR Policy Manual v4.2, updated 2026-01-15]". Prompt the model: "Cite the source document and date in your response." Enables verification and builds trust.

Mistake 5 - Using RAG when a database is better

Fix

The mistake

"Store customer records, order data, and pricing tables in the vector database. Query with semantic search."

Semantic search on structured data is unreliable. "What is the price for SKU-4821?" should be a SQL query, not a similarity search. Vector search adds latency and produces probabilistic results where you need exact ones.

The fix

RAG is for unstructured text: documents, policies, guides, Q&As. Structured data belongs in SQL or a proper database. Hybrid architectures connect both: the AI can call both a SQL database and a RAG system depending on the query type.

🚧

What Memory Cannot Fix

5 min

RAG is not a reliability fix. It is an information injection mechanism. Understanding this distinction prevents a dangerous overconfidence that frequently leads to production failures.

RAG injects context into the prompt. It does not make the model more accurate within that context. The model is still predicting the next token - it is still capable of misreading, misquoting, or partially interpreting the retrieved chunks. The retrieval can be perfect and the generation can still be wrong.

🔢

RAG does not fix arithmetic errors

If the retrieved document contains numbers, the model can still perform incorrect calculations on those numbers. RAG gets the numbers into context. It does not guarantee the model uses them correctly.

🧠

RAG does not fix complex reasoning failures

Multi-hop reasoning - "given these three documents, what is the implied conclusion?" - is still performed by the same probabilistic token predictor. Retrieval quality and reasoning quality are independent dimensions.

💭

RAG does not fix hallucination

The model can generate confident, fluent text that contradicts or extends beyond the retrieved document. Retrieval increases the chance of relevant information being present. It does not prevent the model from ignoring or misrepresenting that information.

🔍

RAG does not fix bad chunking

If the retrieval step finds the wrong chunks - because chunking split the key information across two paragraphs, or because the query was ambiguous - no amount of downstream generation quality will produce a correct answer.

The most dangerous RAG failure mode

The model gives a confident, fluent, cited answer that is wrong about the cited source. It retrieved the right document, then misread it. RAG removes the “no information” problem. It does not remove the “wrong interpretation” problem. Human verification remains required for high-stakes outputs.

💬

Reflect & Discuss

10 min

Work through these questions yourself, or bring them to the group session. They are designed to connect the mechanics you've just learned to the real use cases you identified before this session.

QYou identified a repetitive task before this session. What information would a RAG system need to store to make an AI genuinely useful for that task? What are the documents, data sources, and context it would need to retrieve - and how frequently does that information change?
QIf your company deployed an AI assistant on its internal knowledge base today - policies, processes, product documentation - what would the highest-risk failure mode be? What would a RAG mistake look like in that context, and how would you detect it before it affected a real user?
QSession 1 introduced the idea that context management is an engineering discipline. Now that you know about RAG, sliding windows, and summarisation - which pattern applies most directly to how AI tools you already use appear to work? What does that tell you about how those products were built?
QThe cost of memory compounds. If your team ran 500 AI-assisted tasks per day, each with a 10-turn conversation and one 5-page document injected - estimate the token cost. What three changes to that design would reduce cost by 50% without meaningfully degrading quality?

🧪

This Week's Experiment

Before Session 3

Design a RAG system for your own use case

Take the task you mapped in Session 1. Write out: (1) What documents would you need to index? (2) How would you chunk them - fixed-size, recursive, or semantic? (3) What question would a user typically ask? (4) What 3 chunks would ideally be retrieved to answer it? Draw this on paper or in a document. This becomes your starting point for Session 3, where we go into fine-tuning, prompting strategy, and the full customisation decision matrix.

The goal is not a perfect design - it is to force the architectural questions. Most teams never ask “how would I chunk this?” until they are already in production and retrieval is failing. Asking it now, with a real use case, makes Session 3 immediately applicable.

Coming next Sunday

Session 03 - Fine-tuning vs Prompting vs RAG

May 11 · 7:00 PM · The full customisation decision matrix

Preview →