Architecture

Why LLMs Forget: The Context Window Problem No One Is Solving

Every conversation with an LLM starts from zero. No matter how deeply you discussed your project architecture yesterday, no matter how carefully you explained your dietary restrictions last week, the model greets you today with the same blank slate. It has no idea who you are.

The industry's answer so far has been to make the window bigger. GPT-4 shipped with 8K tokens, then 32K, then 128K. Claude pushed past 200K. Gemini hit a million. The implicit promise: give the model enough room and forgetting goes away.

It does not.

Bigger Windows, Same Problem

A larger context window means the model can hold more text at once. That is not the same as remembering what matters. Consider what happens when you stuff a million tokens into a prompt:

  • Every token costs the same attention. The model has no mechanism to say "this fact is critical" versus "this is background noise." A casual mention of your favorite coffee has the same weight as a deeply personal disclosure from three months ago.
  • Signal drowns in noise. Research consistently shows that LLMs struggle to retrieve specific facts from long contexts. The famous "needle in a haystack" tests demonstrate that accuracy degrades as context length grows, especially for information buried in the middle.
  • Latency and cost scale linearly. Doubling the context window roughly doubles inference time and cost. Filling a million-token window for every query is economically unsustainable for any production application.
  • It still resets. No matter how large the window, it empties between sessions. You cannot carry a million tokens from Tuesday's conversation into Wednesday's.

Bigger windows are a brute-force approach to a problem that requires intelligence. The bottleneck was never the size of the window. It is the absence of any system to decide what belongs inside it.

The Approaches That Fall Short

Sliding Window Truncation

The simplest strategy: keep the most recent N messages and drop everything else. This is what most chatbot applications do by default. It works tolerably for short, single-topic conversations. It fails catastrophically for anything with history.

If a user mentioned a severe food allergy in message 3 and the conversation is now on message 50, sliding window truncation has silently discarded a safety-critical fact. The model does not know it forgot. It will confidently recommend recipes containing the allergen.

Conversation Summarization

A more sophisticated approach: periodically summarize older messages and inject the summary into context. This preserves some history while saving tokens. But summarization introduces two problems.

First, lossy compression is irreversible. Once a detail is summarized away, it cannot be recovered. The summarizer must decide what matters, and it does so without any scoring framework — it guesses, often poorly. Nuance, emotional context, and specificity are the first casualties.

Second, summaries compound. A summary of a summary of a summary rapidly degrades into vague platitudes: "The user discussed various personal topics." That sentence contains zero actionable information, yet it consumes tokens.

Retrieval-Augmented Generation (RAG)

RAG addresses a different problem: connecting an LLM to external knowledge. A user asks a question, the system searches a document store (usually via embeddings), and injects the top-K most similar chunks into the prompt. RAG is excellent for knowledge bases, documentation, and factual grounding.

But RAG is not memory. It retrieves by similarity, not by importance. It cannot distinguish between something the user casually mentioned once and something they have revisited across twenty conversations. It has no concept of decay, no notion that recent information might outweigh old information of equal relevance, no ability to suppress a topic the user has asked to stop discussing.

RAG answers the question "what text is most similar to this query?" Memory answers the question "what does this system need to know about this person right now?"

The Real Problem: Prioritization

Human memory does not work by storing everything in a giant buffer. It works through a sophisticated system of salience scoring — unconsciously assigning importance to experiences based on dozens of factors: emotional intensity, repetition, recency, novelty, personal relevance, and more.

When you walk into a room, you do not consciously process every object. Your brain surfaces what matters: the person waving at you, the exit sign, the smell of smoke. Everything else fades into background. This is salience in action.

LLMs have no equivalent mechanism. Every token in the context window receives equal treatment. The model cannot distinguish between "my daughter's name is Ellie" (mentioned twenty times across six months) and "I had a salad for lunch" (mentioned once, three months ago). Both are just tokens.

The context window is not a memory system. It is a scratchpad. Memory requires scoring, decay, and retrieval — none of which a context window provides.

What Memory Actually Requires

To give an LLM genuine memory — the kind that makes a user feel known and understood — you need four capabilities that no context window provides:

  1. Scoring. A way to quantify the importance of each piece of information using multiple signals: how often it has been referenced, how much it diverges from baseline, how central it is to the user's identity, how recently it was relevant.
  2. Decay. Memories should fade over time — but not uniformly. Information that has been repeatedly accessed should decay more slowly, just as heavily rehearsed memories persist longer in human cognition. A memory mentioned once should gradually lose salience. A memory revisited every week should remain prominent.
  3. Selective retrieval. At query time, the system must choose which memories to inject into the limited prompt space. This is not a similarity search. It is a multi-factor ranking that considers salience, recency, relevance to the current query, and safety constraints.
  4. Safety. Memory creates risk. A system that remembers a user's trauma history must handle that information with extraordinary care — never surfacing triggers carelessly, never fabricating memories that did not occur, never storing information that should not persist (like credit card numbers spoken aloud).

These are hard engineering problems. They are not solved by making the context window bigger. They are solved by building a dedicated memory layer that sits between the application and the LLM.

The Middleware Approach

This is the approach we took with KAPEX. Rather than relying on the LLM to manage its own memory (it cannot), or forcing developers to build bespoke memory systems for every application (they should not have to), KAPEX operates as middleware: a layer that intercepts LLM interactions, extracts and scores memories, and injects the highest-salience context into each prompt.

The LLM never sees a raw dump of every past interaction. It sees a curated, scored, and safety-filtered set of memories — the ones that are most likely to matter for this specific conversation. The result, validated in a blinded study of 1,655 participants, is an AI that feels fundamentally different to interact with. Users chose the memory-equipped system 80% of the time at conversation depth.

The Bottom Line

Context windows will keep getting larger. That is useful for many tasks — long document analysis, code generation, multi-step reasoning. But for the problem of making AI that knows you, window size is a distraction.

The question is not "how much can the model hold?" It is "how does the system decide what to put there?" Until that question has a principled answer — one involving scoring, decay, retrieval, and safety — every LLM conversation will start from zero.

That is the problem KAPEX was built to solve.

Patent pending

Give your AI a memory that matters.

Start a free 30-day pilot. No contract. No credit card. Just a five-minute feedback form at the end.