
When you use an AI agent, the more contextual data the agent has about the job, the better it will perform. But agents don't have much memory, since the large language models (LLMs) they depend on are stateless. When their memory runs out, the agent glitches out, hangs up, or spews out nonsense. Tactics like truncating or compacting agent memory can make up for this, but they're not real solutions.
A better answer to the AI agent memory crunch is memory that lives and persists outside of the agent itself. The agent's memory is still used for immediate work, but the longer-term, big-picture details get offloaded to another service and retrieved on demand.
The term for this is retrieval-augmented generation, or RAG. It has become as significant a technology as the agents and LLMs themselves, as it expands their capabilities in-place.
The basics of RAG
LLMs have what's called a "context window" — a block of working memory up to a certain size that's used for processing input. The maximum size of the window varies depending on the model. For instance, GPT-4 offers up to 128,000 tokens, while newer models like Gemini 1.5 Pro can handle over 1 million tokens. The more memory devoted to the context window, the more information the model can process (e.g., a file containing code for analysis), and the more complex the conversation it can sustain.
The premise behind RAG is simple: Use the LLM's context window for information that matters in the immediate conversation, and use persistent storage systems (RAG) for information outside of that. The model's context window serves as short-term memory, and RAG serves as long-term memory. This distinction is critical because it allows agents to handle long-running tasks, maintain state across sessions, and recall past interactions without exceeding token limits.
What's more, RAG storage comes in a few different forms. A 2024 paper entitled "Cognitive Architectures for Language Agents" goes into great detail about them, but it's worth breaking them down in plainer language. These forms mimic human memory types, making the agent's behavior more intuitive and robust.
The different kinds of RAG memory
Let's examine the three basic ways RAG storage works: episodic memory, semantic memory, and procedural memory. Each serves a distinct purpose and requires different implementation strategies.
Episodic memory: flows and processes
Episodic memory stores data generated from some previous point in time by the LLM — a decision the LLM made, and the result of that decision. These experiences can be ordered by time to produce what the above paper describes as "history event flows", or the processes that generated some particular output. Through episodic memory, the LLM can reconstruct a decision or process it previously performed, and use that experience to guide future action.
For example, if an agent helped a user debug a Python script by identifying a missing import, that interaction can be stored as an episodic record. Next time the user encounters a similar issue, the agent can recall the exact steps taken and apply them again. Episodic memory is especially useful for multi-turn conversations, customer support bots, and any scenario where past outcomes inform future choices.
Implementation-wise, episodic memories are often stored as timestamped logs in a vector database. Each entry may include the query, the agent's action, the result, and metadata like the date and user ID. Retrieval is typically based on semantic similarity to the current context.
Semantic memory: facts and things
Semantic memory stores structured data "about the world and [the agent] itself", as the paper puts it. This could be as simple as using a basic key/value store for user preferences, or could involve a more complex system like vector embedding. The point is to give the agent a way to look up such "world knowledge" readily, and to have it available in a format the agent can use as-is.
For instance, a travel booking agent might maintain a semantic memory of preferred airlines, seat preferences, and past destinations. Similarly, a code review agent could store the coding standards of a particular organization. The data is factual and relatively static, meaning it changes infrequently but is accessed often.
It also helps for semantic memory to be controllable. As the paper notes, an external source like Wikipedia is "an external environment that may be unexpectedly modified by other users," but an offline version (essentially, a static point-in-time snapshot) would not have this problem. In enterprise settings, semantic memory is often populated from curated knowledge bases, internal documentation, and configuration files. This ensures consistency and reliability.
Procedural memory: tasks and skills
On the surface, procedural memory sounds a little like episodic memory: it's used to store things like reasoning processes or learning procedures. But procedural memory is specifically for allowing the LLM to reproduce the steps of a process, rather than the mere fact that it followed such a process. It allows those procedures to be performed repeatedly without having to be re-discovered or re-created from scratch each time.
Consider an agent that performs data cleaning: it might have a procedure for handling missing values, removing duplicates, and normalizing formats. With procedural memory, the agent can recall the exact sequence of operations and apply them to new datasets. Similarly, a deployment agent might store the steps for deploying a containerized application to a cloud environment, including rollback procedures.
An important thing about each of these kinds of memories: they favor reads over writes. For instance, semantic memory isn't written to very often, though it can be useful for the agent to record new facts it learns about its world. By contrast, letting the agent write freely to procedural memory might "introduce bugs or allow an agent to subvert its designers' intentions", as the paper notes. Therefore, updates to procedural memory are typically restricted to trusted sources or require human approval.
Implementing RAG
While RAG itself is a standard on the agent's side, there's no one canonical way to implement RAG storage. The storage layer is typically a vector database, although many modern databases support vector functionality. Popular vector databases include Pinecone, Weaviate, Milvus, and Qdrant, while traditional databases like PostgreSQL (via pgvector) and Elasticsearch also offer vector capabilities.
Also, where that memory lives can be more open-ended. A service that provides access to an LLM, for instance, could include RAG on the server side as part of its package of offerings. This is common in platforms like OpenAI's Assistants API or Google's Vertex AI, which provide built-in knowledge retrieval. A locally-run LLM could have RAG storage services running side-by-side on the same system that hosts the model. The downside of this last approach is that the system will require that much more local storage and processing power.
RAG storage also requires its own separate upkeep. Each agent and use case will impose different demands on how to manage that storage. Older data, for instance, might need to be aged out periodically, or given less weight than newer or more frequently accessed data. This aging mechanism ensures that the agent's long-term memory remains relevant and doesn't become cluttered with obsolete information. Techniques like time-based decay, recency-weighted scoring, and periodic purging are commonly used.
Finally, while multiple agents can share the same RAG storage, they shouldn't do so indiscriminately. At the very least, each agent should operate in its own context so that data and use cases from one agent don't interfere with others. A more complex and ambitious approach is to use a tool like Microsoft AutoGen to build shared multi-agent RAG contexts. In such architectures, agents can collaborate by writing to and reading from a common memory store, but with namespacing or access controls to avoid collisions.
Beyond these basics, developers must also consider chunking strategies for documents, embedding models for semantic search, and evaluation metrics for retrieval quality. The choice of embedding model (e.g., OpenAI's text-embedding-3-small, Sentence Transformers) directly affects how well the agent can find relevant information. Similarly, the chunk size and overlap impact recall and accuracy. In practice, iterative experimentation is necessary to tune these components for a given domain.
Another consideration is hybrid search: combining keyword-based retrieval with semantic search to improve coverage. For example, an agent might use BM25 for exact matches and vector search for conceptual matches, merging results with a reciprocal rank fusion algorithm. This hybrid approach is particularly effective in domains with specialized jargon or rare terms.
Security is also paramount because RAG storage may contain sensitive user data. Encryption at rest and in transit, access control lists, and audit logging are essential. Moreover, agents should be designed to forget information when requested (right to erasure) and to avoid memorizing personally identifiable information unless explicitly allowed.
Looking ahead, research continues on memory consolidation techniques, where the agent periodically compresses episodic memories into semantic ones, similar to how humans transform experiences into general knowledge. This reduces storage requirements and improves retrieval speed. Additionally, some frameworks are exploring hierarchical memory, where short-term, medium-term, and long-term stores coexist with different eviction policies.
In summary, RAG offers a powerful paradigm for overcoming the inherent memory limitations of LLM-based agents. By offloading facts, experiences, and procedures to external stores, agents can maintain coherent, long-running interactions and learn from the past without being constrained by context windows. The key is to choose the right type of memory for each task, implement robust retrieval and storage mechanisms, and continuously monitor and refine the system.
Source:InfoWorld News
