Reference Title

Context Windows Are Not Memory

Q: Are you available for remote work?

Yes, I am fully available for remote opportunities, freelance projects, and full-time roles across different time zones.

Q: Do you accept freelance projects?

Yes, I work on contract and freelance projects ranging from backend architecture consultation to full system implementation.

Q: What is your expertise?

I specialize in backend architecture, distributed systems, machine learning operations (MLOps), API design, system design, and production-scale AI integration.

Q: Where are you based?

I am based in Cairo, Egypt, but work with teams globally and am comfortable with remote and hybrid arrangements.

                AUTHOR: M. ABDELNABY
                DATE: 2026-04-08
                CATEGORY: Mental Models
                MODE: READONLY
            

Synopsis

Engineers building on top of LLMs routinely mistake the context window for a memory system. It is not. It is a fixed-size input buffer that is reconstructed from scratch on every inference call. The mental model you use determines whether your RAG pipeline retrieves the right chunks — or confidently hallucinates from the wrong ones.

DESCRIPTION

When you call an LLM API, there is no persistent state. The model has no memory of your previous requests. The context window — the token buffer containing the conversation, retrieved documents, and system instructions — is assembled by your application before every single call, sent to the model, and discarded afterward.

This is not a limitation to work around. It is the architecture. Engineers who internalize it build RAG pipelines that are fast, debuggable, and correct. Engineers who do not build pipelines that work in demos and hallucinate in production.

THE WRONG MENTAL MODEL

User: "What was the budget we discussed last week?"

# The engineer assumes:
LLM → has access to last week's conversation
LLM → remembers what "we" discussed
LLM → can retrieve "the budget" from some internal store

None of this is true. The model received a token sequence that started with "What was the budget we discussed last week?" with zero prior context. It will either refuse to answer, make something up, or ask for clarification — depending on its instruction tuning.

The application is responsible for every piece of information the model has access to. If it is not in the context window, the model does not know it exists.

THE CORRECT MODEL: CONTEXT IS AN INPUT, NOT A SESSION

Every inference call =
    system_prompt
    + retrieved_documents (from your vector store)
    + conversation_history (from your database)
    + current_user_message
    ────────────────────────────────────────────
    → token sequence → model → response token sequence

Your application assembles this input. It sends it. It gets a response. It stores the response in its own database if it needs to use it later. The model is stateless. All state management is your responsibility.

def call_llm(user_message: str, session_id: str) -> str:
    # You retrieve history — the model does not have it
    history = conversation_store.get(session_id, last_n=10)

    # You retrieve relevant documents — the model cannot search
    chunks = vector_store.search(user_message, top_k=5)

    context = build_context(
        system_prompt=SYSTEM_PROMPT,
        documents=chunks,
        history=history,
        message=user_message
    )

    response = llm_client.complete(context)

    # You persist the exchange — the model forgets immediately
    conversation_store.append(session_id, user_message, response)

    return response

The model is a function. Input tokens in, output tokens out. Your application is the memory system.

THE RETRIEVAL PROBLEM

RAG (Retrieval-Augmented Generation) pipelines fail in a specific, repeatable way: the retrieval step surfaces chunks that are semantically adjacent to the query but contextually wrong for the answer.

User query: "What are our SLA commitments for enterprise customers?"

Vector search returns:
  chunk_1: "...SLA definitions and general terms..." (similarity: 0.91)
  chunk_2: "...enterprise pricing tiers..."          (similarity: 0.88)
  chunk_3: "...SLA for standard tier customers..."   (similarity: 0.87)

Missing:
  chunk_4: "...enterprise SLA: 99.95% uptime..."    (similarity: 0.71)
  ← this is the correct answer, ranked 8th

Similarity is not relevance. A chunk about SLA definitions is semantically close to a query about SLA commitments but may not contain the answer. Embedding distance measures topical proximity, not information value.

Fix this with hybrid search — combine dense vector retrieval with sparse keyword matching (BM25). The keyword match will surface the chunk that contains "enterprise SLA: 99.95%" even if its embedding is not the closest.

def hybrid_search(query: str, top_k: int = 5):
    dense_results = vector_store.search(query, top_k=top_k * 2)
    sparse_results = bm25_index.search(query, top_k=top_k * 2)

    # Reciprocal rank fusion: combine rankings, not scores
    return reciprocal_rank_fusion(dense_results, sparse_results, top_k=top_k)

CONTEXT WINDOW AS A BUDGET

Every token in the context window is a cost and a tradeoff. A 128k context window does not mean you should fill it.

128k token window breakdown (approximate):
──────────────────────────────────────────
System prompt:           ~500 tokens
Conversation history:  ~2,000 tokens
Retrieved chunks:     ~10,000 tokens
User message:            ~100 tokens
────────────────────────────────────
Remaining for response: ~115,000 tokens (unused)

Costs scale with context length. Latency scales with context length. And critically — model attention is not uniform across the window. Research consistently shows that information at the very beginning and very end of a long context gets more reliable attention than information buried in the middle. Stuffing 50 chunks into the context does not mean the model will use all 50. It means the 35 in the middle are probably ignored.

Retrieve fewer, better chunks. Rank them by relevance. Put the most critical context closest to the query. Treat context window space as a constrained resource.

HALLUCINATION IS A RETRIEVAL FAILURE

The most common source of hallucination in RAG systems is not the model making things up from nothing. It is the model being given insufficient or misleading retrieved context and extrapolating.

Query: "What is our refund policy for enterprise contracts?"

Retrieved chunks: none that contain refund policy for enterprise

Model has two options:
  1. Say "I don't have that information"
  2. Extrapolate from standard tier refund policy in context

Without explicit "I don't know" instruction:
  Model chooses option 2 — confidently, incorrectly

Fix this at the prompt level, not the model level:

SYSTEM: You are a support assistant. Answer only from the provided documents.
        If the documents do not contain the answer, respond exactly:
        "I don't have that information in the provided context."
        Do not infer, extrapolate, or use general knowledge.

And verify at the application level: if the retrieval step returns zero relevant chunks above a confidence threshold, do not call the model. Return a "no information found" response directly. Calling the model with empty context and asking it to answer is asking it to hallucinate.

MODEL

The context window is an input buffer. It is assembled by your application, sent on every call, and discarded. The model is stateless. Your application owns memory, retrieval, history, and context construction. Hallucination is most often a failure of what you put in the window, not a failure of the model itself. If the answer is not in the context, do not expect the model to find it — expect it to invent something plausible.

Design the retrieval pipeline before you design the prompt. The prompt is the last 10%. The retrieval is the system.