Context, Memory & Why AI Forgets
Session 1 ended by planting a question: how do companies build AI that actually remembers things? This session answers it completely - RAG, vector databases, memory patterns, and the real cost of giving AI a memory.
Where We Left Off
5 minSession 1 ended with a planted question: “If every new chat starts from zero, how do companies build AI systems that actually remember things?” That question is this session's entire agenda - and by the end, you'll have the complete answer.
Here is what Session 1 established: models are stateless, weights are frozen at training, and the context window is the only “memory” a model has for any given request. When you close the tab, everything disappears. The model doesn't carry forward a single word.
The pre-session assignment asked you to map a repetitive task to an AI opportunity. That task is your working example for this entire session - the documents it would need, the context it would require, the memory it would have to persist. Keep it in mind.
Models are stateless next-token predictors. Weights are frozen. Context window is working memory. Chat history is text re-injected into every new call - not recollection.
If the model is stateless and context resets every call, how does any AI product appear to “know” you, retain documents, or have access to company data? That gap is an engineering problem - not a model capability.
The three ways engineers give AI memory. RAG mechanics end-to-end. Vector databases and embeddings. Memory design patterns for enterprise products. The real cost of context injection.
Every enterprise AI product that uses your company's data must solve the memory problem. Understanding how gives you the ability to design solutions - not just use ones others have built.
Why AI Has No Memory
8 minStatelessness is not a bug in LLMs. It is the design. Every API call is completely independent. No knowledge, no context, no history flows between sessions unless you explicitly engineer it to do so.
Recall the frozen weights point from Session 1: at inference time, the model's parameters don't change. Nothing about your conversation affects the underlying model. The weights that existed when you started are identical when you finish.
There's a critical distinction most people miss: “the model doesn't remember” is a model property. “The system doesn't persist” is an engineering choice. You cannot change the first. You absolutely can change the second - and that's what the rest of this session is about.
When you use a product that “remembers” your name, your preferences, or your past questions - that product is injecting stored context into the current prompt. The model isn't recalling anything. A database retrieved your profile, and it was pasted into the context window before your message was processed. What looks like memory is text injection.
“The model forgets” is a model property. “The system doesn't persist” is an engineering choice. You can't change the model. You can change the system.
The Context Window - Your Working Memory
8 minSession 1 introduced the context window. This session goes deeper - because understanding how it fills, what gets dropped, and why position matters is the foundation of every memory engineering decision you'll make.
Think of the context window as working memory for one request. Everything the model knows, for this specific call, must live inside it. What doesn't fit simply does not exist as far as the model is concerned. No warning. No error. Silent truncation.
A million-token context window sounds like it eliminates the memory problem. It doesn't. You still pay per token. You still get primacy/recency bias. You still need to curate what goes in. Bigger windows reduce the urgency of good context management - they don't replace it.
Three Ways to Give AI Memory
10 minThere are exactly three approaches to making an AI system retain information across a session, across sessions, or across all users. They differ in cost, complexity, freshness, and what kind of knowledge they handle well. Understanding all three - and when each is right - is the core design skill this session builds.
Most enterprise teams reach for fine-tuning first and RAG second. This is backwards. RAG is faster to deploy, cheaper to run, keeps data fresh, and can be audited - you can see exactly what was retrieved and why. Fine-tune only when RAG genuinely can't solve it.
The Knowledge Cutoff Problem - and Its Solution
6 minSession 1 introduced the knowledge cutoff. Now that you understand the three memory approaches, you can see clearly why it matters and how to solve it.
The cutoff means the model genuinely does not know about anything that happened after its training data was collected. But the more practical problem for most enterprises isn't news events - it's proprietary data. Your internal documents, last month's pricing, updated policies, customer records: the model has never seen any of it.
The naive solution - paste the document into the prompt - works for one document, one question. It fails immediately when you have hundreds of documents, when the right document isn't obvious, or when the document is longer than available context. RAG is the scalable solution.
Your internal documents and policies. Your current pricing. Post-cutoff news and regulations. Anything proprietary or non-public. Internal processes and institutional knowledge. Client histories and account data.
Current knowledge - as fresh as your last document ingestion. Specific documents retrieved on demand. Proprietary data without retraining the model. Real-time injected context with source attribution.
If the information is proprietary, recent, or specific to your organisation, it must be injected. Don't rely on the model's training data for anything your business depends on. Treat the model as a reasoning engine, not a source of truth.
RAG: How Retrieval Actually Works
12 minRAG - Retrieval Augmented Generation - is the standard enterprise architecture for giving AI access to proprietary or current information without retraining. Here is the complete pipeline, step by step.
WITHOUT RAG - prompt only contains training knowledge
System: You are a policy assistant for Acme Corp.
User: What is the refund policy for enterprise contracts?
→ Model answers from training data. Generic. Possibly wrong.WITH RAG - retrieved chunk injected into context
System: You are a policy assistant for Acme Corp.
Context [from Policy Doc v3.2, updated 2026-03-01]:
"Enterprise contract refunds are processed within 14 business
days of written cancellation request. Refunds apply pro-rata
to unused months. Setup fees are non-refundable..."
User: What is the refund policy for enterprise contracts?
→ Model answers from the actual, current policy document.RAG doesn't make the model smarter. It makes the model's context richer. The model is still doing the same thing - predicting the best next token. RAG just ensures the most relevant information is in the window when it does. The intelligence is in the retrieval; the generation is the same as always.
Embeddings and Similarity Search
8 minSession 1 introduced the idea that tokens become vectors - that “King − Man + Woman ≈ Queen” is a real result in vector space. Embeddings are the foundation of RAG retrieval. Here is the complete picture.
An embedding is a list of numbers - typically 768 to 1,536 floating-point values - that encode the meaning of a piece of text. The embedding model is trained on massive amounts of text to learn that texts with similar meaning should produce similar numerical vectors. The numbers themselves have no direct interpretation - only their relationships to each other matter.
Similarity is measured using cosine similarity: how similar are the directions of two vectors in this high-dimensional space? Two identical texts have cosine similarity of 1.0. Unrelated texts have cosine similarity near 0. Semantically related but differently-worded texts score somewhere in between - which is what makes semantic search work.
Why does this beat keyword search? Because it captures meaning, not just word overlap.
Finds documents containing the exact search terms. Fails when the user asks in different words than the document uses. “Cardiac arrest protocol” would not find a document titled “Heart attack procedures.” “Staff reduction policy” would miss “Employee termination guidelines.”
Finds documents with similar meaning, regardless of exact words. Searches for “heart attack” and retrieves documents about “cardiac arrest”. Searches for “staff reduction” and finds “employee termination.” Works because both pairs map to nearby vectors in embedding space.
A distance visualisation. Imagine a simplified 2D version of embedding space. A query about “vacation leave policy” produces a vector. In the document store:
HR Policy Doc - Annual Leave and PTO Guidelines
Similarity: Very close - Would be retrieved
Employee Handbook - Time Off and Absence Management
Similarity: Close - Would be retrieved
Payroll Policy - Holiday Pay and Overtime
Similarity: Related - Borderline (depends on k)
IT Security Policy - Password Requirements
Similarity: Unrelated - Would not be retrieved
The embedding model is one of the most important architectural choices in a RAG system - and most teams don't think about it until retrieval starts failing on domain-specific terminology. A general embedding model may not understand your industry's specialised vocabulary well. Test retrieval quality on your own documents before committing to an embedding model.
Vector Databases in Practice
8 minA vector database stores embedding vectors alongside their source text, and enables fast similarity search across millions of entries. When your RAG system asks “what documents are most similar to this query?”, the vector database answers that question - typically in 10–100 milliseconds.
What makes a vector database different from a regular database: regular databases are optimised for exact lookups (“find the row where ID = 4821”). Vector databases are optimised for approximate nearest-neighbour search (“find the 10 vectors most similar to this query vector”). The underlying algorithms - HNSW, IVF, PQ - are specialised for this problem.
Pinecone - Fully managed, fast, scalable. Best-in-class latency and developer experience. More expensive at high volume. Good default choice if you want zero infrastructure overhead.
Weaviate - Open-source with managed option. Hybrid search (keyword + vector) built in. Strong for structured data with filters. Good for enterprise knowledge bases where you need both semantic and exact-match queries.
pgvector - Postgres extension. If you're already on Postgres, this is zero infrastructure cost. Lower performance at massive scale. Perfect for teams that want to stay in their existing database without adding a new service.
Chroma - Open-source, excellent for development and prototyping. Not production-ready at enterprise scale but the fastest way to stand up a local RAG system for testing.
Qdrant - Open-source, very fast filtering on metadata, strong on-premise story. Good choice for organisations with data sovereignty requirements or air-gapped infrastructure.
Start with pgvector if you're already on Postgres. Start with Chroma for development. Graduate to Pinecone or Weaviate when you need scale, hybrid search, or enterprise SLAs. Never pick a vector database before you know your retrieval volume and whether you need exact-match filtering alongside semantic search.
Four RAG Patterns for Enterprise
10 minNot all RAG implementations are equal. There are four patterns, each with different complexity, cost, and retrieval quality. Start simple and graduate as your requirements demand it.
Start with Naive RAG. Measure retrieval quality on real queries. Only move to Advanced or Hybrid when you can identify specific failures Naive RAG can't handle. Complexity should be earned by actual retrieval failures - not added pre-emptively. Agentic RAG is for when you need multi-hop reasoning that can't be pre-specified.
Managing Conversation Memory
8 minSession 1 introduced conversation memory briefly. Now that you understand the full context picture, here are the four production patterns - and when each one is appropriate.
For short task-completion sessions (under 10 turns): sliding window. For extended work sessions (10–50 turns): summarisation. For long-running relationships - returning users over days or weeks: structured extraction + external store. The pattern should match the session lifetime, not the most impressive architecture.
Context Hygiene - What Goes In Must Earn Its Place
6 minEvery token in your context window costs money and competes for the model's attention. Raw documents injected without pre-processing are almost always the wrong approach. A 20-page PDF cleaned down to 4 relevant paragraphs will outperform the raw PDF every time.
Context hygiene is not optimisation - it is correctness. A model given clean, relevant context consistently outperforms the same model drowning in noise. Preprocessing is the highest-leverage, lowest-cost improvement available in any RAG system.
Splits at token count boundaries. Simple to implement. Risks cutting sentences mid-thought. Best for: uniform documents with even information density - regulatory text, structured forms, standardised templates.
Splits by sentences first, then paragraphs, then sections - respecting semantic boundaries. Preserves meaning within each chunk. Best for: prose documents, emails, reports, policy documents - anything written in natural language.
Uses embedding similarity to detect topic shifts. Splits when meaning changes significantly - not at arbitrary token counts. Best for: long documents covering multiple topics (a 50-page handbook, an annual report, a multi-section policy).
Every chunk gets metadata: source document, page number, section title, date, author. Metadata enables filtering ("only search last 30 days"), attribution ("answer based on [document]"), and freshness tracking.
What to strip before ingesting: headers and footers (every page says “CONFIDENTIAL - Page 12 of 47”), repeated boilerplate, table of contents (it duplicates structure without adding content), excessive whitespace, page numbers.
What to keep: the substantive content, section titles (important for context), document title and date as metadata, author or source attribution.
Pre-processing is the highest-leverage, lowest-cost improvement in any RAG system. A 20-page PDF cleaned down to 4 relevant paragraphs outperforms the raw PDF every time. Before debugging retrieval quality, debug your preprocessing pipeline.
RAG vs Fine-tuning vs Prompting - The Decision Matrix
10 minThese are not alternatives in a hierarchy - they are tools for different problems. The decision of which to use depends on four factors: what kind of knowledge is needed, how frequently it changes, what volume of calls you're running, and what your control requirements are.
When: task is general, data fits in prompt, volume is low, no proprietary knowledge needed. Cost: generation tokens only. Freshness: only as fresh as model training. Control: high - you control every word. Best for: personal productivity, general writing, exploration.
When: large knowledge base, proprietary documents, data changes frequently, citations needed. Cost: embedding at ingest + vector search + injected chunk tokens. Freshness: as fresh as your last document ingestion. Control: high - you choose what to retrieve and what to inject.
When: stable domain vocabulary, consistent task patterns, style that cannot be prompted, model needs to internalise a specific behaviour. Cost: training compute + retraining cadence. Freshness: only as fresh as your last training run. Control: medium - you cannot inspect or directly edit weights.
RAG - index the pricing document. Update it quarterly. Fresh on next query.Fine-tuning - model goes stale in 3 months. Retraining costs repeat every quarter.Prompting - well-crafted system prompt + examples handles tone reliably.Fine-tuning - valid if tone requirements are very specific and prompting consistently fails.RAG - index all 5,000 documents. Retrieve relevant ones per query. Scalable.Fine-tuning - cannot fine-tune 5,000 documents effectively. Data changes anyway.Prompting + examples - works for most formatting tasks.Fine-tuning - valid here. Stable format, specialised vocabulary, consistent patterns.RAG - sync inventory data to knowledge base daily. Always current.Fine-tuning - completely wrong here. Data changes daily. Model would be stale immediately.Fine-tuning a model with company data that changes quarterly. Three months later, the model is stale. You pay for retraining. Meanwhile, a RAG system would have been updated continuously for near-zero incremental cost - just re-ingest the new documents.
Try It Yourself - Three Exercises
15 minEach exercise takes 5–10 minutes and demonstrates something you cannot fully grasp from reading alone. No code required for any of them.
Take 5 paragraphs from any document you work with - a policy, a process doc, an FAQ. Label them 1–5.
For each paragraph, write one sentence summarising what it is about. This simulates what an embedding encodes - the meaning of each chunk.
Write a question someone in your organisation might ask about this document. Write it without looking at the paragraphs.
Now, without re-reading the full text: which paragraph(s) would best answer the question? That selection process is similarity search - done manually. You just retrieved the most relevant chunk.
Count the tokens in those paragraphs using tiktokenizer.vercel.app. That is your injected context cost for this query. Multiply by your daily query volume.
Start a fresh conversation. In your very first message, include: "My name is [your name], I work in [department], my main challenge this quarter is [X]."
Have a conversation for 5–6 turns on a completely different topic - ask for help with something unrelated.
Ask: "What do you remember about me from the beginning of our conversation?" Note how accurately it recalls.
Now start a completely new conversation - fresh tab. Ask the same recall question without providing the context first.
Observe: within a session, context from turn 1 persists. Across sessions, it is completely gone. That is statelessness you can feel directly.
Take the repetitive task you identified before this session. Write it at the top of a piece of paper or doc.
Ask: does this task require the AI to know your company's documents, policies, or data? If yes → RAG. Write "RAG" next to it.
Ask: does this task require the AI to remember context across multiple days or sessions - user preferences, ongoing projects, past decisions? If yes → External memory store.
Ask: does this task require the AI to behave in a very specific domain-adapted way that cannot be achieved through prompting? If yes → Fine-tuning candidate.
Write down which memory pattern(s) your task needs. This is your architecture note. Keep it - it becomes your starting point for Session 3.
Enterprise Memory Design
8 minEnterprise AI products typically need all three memory tiers simultaneously. Understanding which tier handles which need prevents the common mistake of trying to solve everything with a single RAG index.
The Cost of Memory
8 minMemory has a token cost. Every technique discussed in this session - RAG, conversation history, user profiles - adds tokens to every API call. At low volume this is trivial. At enterprise scale it becomes the dominant line item. Know the numbers before you design.
Five RAG Mistakes and How to Fix Them
8 minThese are the five patterns that consistently produce bad RAG output. Each one has a specific, practical fix.
"Upload 40-page contract. Inject entire document on every query about the contract."Chunk the document into 400-token sections. Embed each section. At query time: retrieve the 3 most relevant sections (1,200 tokens total). Inject only those. Cost drops 93%. Relevance improves."Chunks of 50 tokens (too small)" or "chunks of 2,000 tokens (too large)"256–512 tokens is the tested starting point for prose documents. Benchmark retrieval accuracy at multiple sizes against 20–30 representative queries. Tune from evidence, not intuition."Index documents once at project launch. Never update."Set an ingestion schedule: daily for high-change content (pricing, inventory), weekly for moderate-change (policies, guides), monthly audit for stable content. Treat the knowledge base like a living document, not a one-time migration."Model answers from retrieved chunks but provides no indication of which document it used."Inject document title + date with every chunk: "[Source: HR Policy Manual v4.2, updated 2026-01-15]". Prompt the model: "Cite the source document and date in your response." Enables verification and builds trust."Store customer records, order data, and pricing tables in the vector database. Query with semantic search."RAG is for unstructured text: documents, policies, guides, Q&As. Structured data belongs in SQL or a proper database. Hybrid architectures connect both: the AI can call both a SQL database and a RAG system depending on the query type.What Memory Cannot Fix
5 minRAG is not a reliability fix. It is an information injection mechanism. Understanding this distinction prevents a dangerous overconfidence that frequently leads to production failures.
RAG injects context into the prompt. It does not make the model more accurate within that context. The model is still predicting the next token - it is still capable of misreading, misquoting, or partially interpreting the retrieved chunks. The retrieval can be perfect and the generation can still be wrong.
The model gives a confident, fluent, cited answer that is wrong about the cited source. It retrieved the right document, then misread it. RAG removes the “no information” problem. It does not remove the “wrong interpretation” problem. Human verification remains required for high-stakes outputs.
Reflect & Discuss
10 minWork through these questions yourself, or bring them to the group session. They are designed to connect the mechanics you've just learned to the real use cases you identified before this session.
- You identified a repetitive task before this session. What information would a RAG system need to store to make an AI genuinely useful for that task? What are the documents, data sources, and context it would need to retrieve - and how frequently does that information change?
- If your company deployed an AI assistant on its internal knowledge base today - policies, processes, product documentation - what would the highest-risk failure mode be? What would a RAG mistake look like in that context, and how would you detect it before it affected a real user?
- Session 1 introduced the idea that context management is an engineering discipline. Now that you know about RAG, sliding windows, and summarisation - which pattern applies most directly to how AI tools you already use appear to work? What does that tell you about how those products were built?
- The cost of memory compounds. If your team ran 500 AI-assisted tasks per day, each with a 10-turn conversation and one 5-page document injected - estimate the token cost. What three changes to that design would reduce cost by 50% without meaningfully degrading quality?
This Week's Experiment
The goal is not a perfect design - it is to force the architectural questions. Most teams never ask “how would I chunk this?” until they are already in production and retrieval is failing. Asking it now, with a real use case, makes Session 3 immediately applicable.