---
url: https://lettuceai.app/docs/memory
title: "Memory System — LettuceAI"
description: "Keep long conversations coherent with manual or dynamic memory using embedding-based retrieval and automatic maintenance."
---

Menu 

# Memory System

LettuceAI keeps long conversations coherent by separating recent chat context from long-term memory. It does not replay your entire history on every turn.

## Memory in plain English

Memory is how LettuceAI helps the AI remember important old details without resending your entire chat every time.

-   **Manual Memory**: you write fixed notes and the AI sees them every turn
-   **Dynamic Memory**: the app tries to pull back the most relevant old details automatically

What most users need to know

If you want simple control, use Manual Memory. If you want longer chats with less babysitting, use Dynamic Memory. The deeper retrieval and maintenance details below are mostly for advanced users.

There are two operating modes. Manual Memory is explicit and predictable. Dynamic Memory is selective: it stores many facts internally, retrieves only the most useful ones per turn, and periodically maintains that memory set in the background.

Direct and group chats

Direct chats and group chats can use different Dynamic Memory settings. Direct chats also depend on the character using Dynamic Memory, while group chats use the memory mode selected for that group session.

## Modes at a glance

| Mode | How it works | Best for |
| --- | --- | --- |
| Manual | You maintain a fixed list of memories that is injected every turn. | Short chats, strict control, small stable fact sets. |
| Dynamic | LettuceAI stores embeddings, retrieves relevant items, and runs a maintenance cycle every few new messages. | Long-running chats, roleplay continuity, lower prompt growth. |

## Manual Memory

Manual Memory behaves like a fixed note sheet. The model sees the full manual memory list on every turn, alongside the recent conversation window.

-   You write the memories yourself.
-   The list stays stable until you edit or delete it.
-   Prompt cost grows with the number and size of manual memories.

## Dynamic Memory

Dynamic Memory is not one giant prompt block. It is a small local memory subsystem with three main pieces of state:

Quick translation

If terms like **embeddings**, **semantic retrieval**, or **cosine mode** sound too technical, the simple version is: the app compares meaning, tries to find relevant old memories, and injects only the useful ones.

-   **Memory entries**: short factual items with embeddings, heat state, access counts, optional pinning, and a category tag.
-   **Context summary**: a compressed summary of processed conversation windows.
-   **Memory events**: a log of each maintenance cycle so the app knows what range of messages has already been processed.

![Diagram showing the three main pieces of Dynamic Memory state](https://lhdgeo5fms.ufs.sh/f/m0TBUtMLsaiExmjHl3mUkw2MN5G1QtVTFbq6Audpci7RZUDX)

Dynamic Memory keeps three kinds of state: the memory entries themselves, a rolling context summary, and a cycle log that tracks processed ranges.

### What happens on each chat turn

When you send a message, LettuceAI first builds the normal prompt from recent conversation, system prompts, persona data, and lorebook content. If Dynamic Memory is active, it also runs memory retrieval.

-   By default, retrieval uses a **smart** strategy: semantic search over hot memories plus a small recency/frequency bias.
-   You can switch to **cosine** mode for pure similarity ranking.
-   Only the retrieved memories are injected into the prompt, not the full memory database.
-   Retrieved memories are marked as accessed, which refreshes their heat and can promote cold items back to hot.

![Diagram showing the per-turn Dynamic Memory flow](https://lhdgeo5fms.ufs.sh/f/m0TBUtMLsaiE29TgWm0CPqXAWsoHbY7GZ0FzOUpuygEvIwme)

Per-turn retrieval is narrow: build the prompt, pull only relevant memories if Dynamic Memory is active, send the final prompt, then save the reply.

### Hot vs cold memory

Each dynamic entry has an importance score. Hot memories are eligible for semantic retrieval. Cold memories remain stored but are normally excluded from semantic search.

-   **Hot**: active working memory for semantic retrieval.
-   **Cold**: archived locally and available mainly through keyword fallback.
-   **Pinned**: protected from decay and forced to stay hot.

Cooling does not delete a memory by itself. It only changes whether the entry participates in normal semantic retrieval.

![Diagram showing how a memory moves between hot and cold states](https://lhdgeo5fms.ufs.sh/f/m0TBUtMLsaiEBCTJVYXeaxpi2W1v7t9cGgFBwm4kZdrEyDTJ)

Heat changes over time. Retrieved memories stay fresh, neglected ones cool down, and cold memories can become hot again when they are found later.

### How retrieval actually works

The retrieval path is more selective than older descriptions of the system.

1.  The app embeds the search query locally. With context enrichment enabled, that query can include more than just the latest message.
2.  It searches **hot and pinned** memories semantically.
3.  In smart mode, it may also surface a recent hot memory or a frequently used hot memory to stabilize recall.
4.  If nothing useful is found, it falls back to a keyword scan over **cold** memories.
5.  Any cold memory that is retrieved becomes hot again.

![Diagram showing the retrieval decision path](https://lhdgeo5fms.ufs.sh/f/m0TBUtMLsaiEhN9IRQYy8QKsz1bUcRixZ7tHBO4qoCruLYPV)

The retrieval pass starts with semantic search over hot memories and only falls back to cold-storage keyword matching when the semantic pass comes up empty.

Important

Dynamic Memory does not search every stored memory semantically on every turn. The hot/cold split is a core part of how it stays fast and cheap.

### Time-aware retrieval (companion mode)

When a companion chat has time awareness enabled, retrieval gets an extra step before scoring. The runtime parses temporal phrases out of the search query (things like "yesterday", "last week", "this month", "two fridays ago", "five weeks ago today", "in the last three days") and resolves them to a local date range.

-   If a range is detected, retrieval first narrows candidates to memories whose `observed_at` timestamp falls inside that range, then ranks them. If nothing falls in the range, retrieval returns empty rather than padding with off-topic entries.
-   When time awareness is on, new memories created during a turn are stamped with an `observed_at` taken from the latest message, which is what makes temporal filtering possible later. Without it, memories carry no event timestamp.
-   The ranker also adds a small lexical-overlap boost on top of cosine similarity, so concrete anchors in the query (names, places) reliably surface memories that mention the same things.
-   Memories that have an `observed_at` are rendered into the prompt with a short "observed YYYY-MM-DD HH:MM TZ" suffix.

Time-aware retrieval is companion-only today. Non-companion chats and roleplay flows skip the temporal phrase parser and run plain semantic retrieval. See **Companion Mode** for the per-session toggle and the time placeholder list.

### Background maintenance cycle

Retrieval happens every turn. Memory maintenance does not. Instead, the app waits until enough new user or assistant messages have accumulated since the last processed window.

1.  After the configured message interval, LettuceAI selects the next unsummarized conversation window.
2.  A summarisation model generates a merged summary for that new window.
3.  A tool-driven pass can **create**, **delete**, **pin**, or **unpin** memory entries.
4.  The app stores the updated summary, the new memory set, and the event cursor so the next cycle continues from the right point.

Manual retries can force another maintenance pass even if the normal interval has not been reached yet.

Structured fallback format

Some models do not reliably emit tool calls during the memory pass. When that happens, LettuceAI falls back to a structured text format (XML by default, JSON optional) so it can still extract memory create, delete, pin, and unpin actions from the model output.

![Diagram showing the Dynamic Memory maintenance cycle](https://lhdgeo5fms.ufs.sh/f/m0TBUtMLsaiELhtqdUum4E1s7fKcbIrtgJlMzOkUhDv6GqQX)

The background cycle is separate from the live chat turn: it summarizes a new window, runs memory tools, and advances the stored cursor.

### Decay, budgets, and deletion

Dynamic Memory uses multiple controls, not just one:

-   **Decay rate**: lowers the importance of hot, unpinned memories across maintenance cycles.
-   **Cold threshold**: entries below this threshold move to cold storage.
-   **Hot memory token budget**: if too many hot memories are active, older unpinned ones are demoted to cold.
-   **Max entries**: if the memory set grows too large, the least recently used unpinned memories can be trimmed entirely.

That last point matters: Dynamic Memory can delete or soft-delete entries during maintenance. Cooling alone is not deletion, but the full system is allowed to remove memories when it decides they are stale, wrong, or lower priority than newer ones.

## Embedding models

Dynamic Memory depends on a local embedding model so the app can compare meaning, not just exact wording.

**v4** is the current default. Older **v1**, **v2**, and **v3** installs may still exist on some devices, but they are now legacy variants.

| Version | Status | Context support | Notes |
| --- | --- | --- | --- |
| v1  | Legacy | 512 tokens | Oldest model. Limited long-context support. |
| v2  | Legacy | Up to 4096 tokens | Major improvement over v1. |
| v3  | Legacy | Up to 4096 tokens | Previous default. Still works, but v4 is preferred for new installs. |
| v4  | Current default | Up to 4096 tokens | 768-dimension model. Recommended for new installs and upgrades. |

On v3 and v4, you can set the local embedding capacity anywhere between 512 and 4096 tokens depending on your memory quality and performance preferences.

### Downloading the embedding model

Embedding models run on your device, so they have to be downloaded before Dynamic Memory can work. The embedding download page walks you through this in two steps:

1.  **Pick a capacity**: choose how many tokens of context the local model should support per memory.
    -   **1K tokens**: smallest and fastest. Best for quick responses and lower-end devices.
    -   **2K tokens**: balanced default. Good quality without much extra cost.
    -   **4K tokens**: maximum context. Best memory quality, uses more memory and CPU per embed.
2.  **Download**: the v4 model files (about 138 MB) are fetched in the background. Progress is shown live, and you can cancel or retry if the download fails.

After the download finishes, the app runs a short self-test to confirm embeddings work end to end before enabling Dynamic Memory. You can upgrade from an older model version (v1, v2, v3) using the same flow.

Switching capacity later

Capacity is chosen per install. If you want to change it later, you can re-run the download flow and pick a different size.

### Context enrichment

Context enrichment improves retrieval by embedding a short contextual slice instead of only the latest message. That helps when the newest user line contains pronouns, callbacks, or implied references.

-   Better follow-up recall
-   Better disambiguation of names and pronouns
-   Better stability when the conversation jumps topics quickly

Local-first

Embeddings are generated and stored locally. Providers only see memory text if that memory is actually injected into the prompt for a turn.

## Token usage and cost

Dynamic Memory usually reduces prompt growth, but it is not free.

-   Retrieved memories still cost tokens when they are injected into a prompt.
-   The summarisation and memory-tool passes also consume tokens on the summarisation model you selected.
-   Manual Memory has the simplest behavior, but its prompt cost rises linearly because the whole memory list is sent every turn.

The reason Dynamic Memory is usually cheaper is not that it costs nothing. It is cheaper because it sends a small relevant subset most of the time instead of replaying everything.

## Managing memories

You still keep control, even in Dynamic Mode.

-   In Manual Mode, you add, edit, and delete the fixed list yourself.
-   In Dynamic Mode, you can review entries, pin important ones, delete incorrect ones, and manually add missing facts.

## Which mode should you use?

-   **Use Manual Memory** if you want a short, stable, always visible rule list.
-   **Use Dynamic Memory** if you want better continuity across long conversations without letting prompt size grow forever.

[

PreviousCompanion Mode

](/docs/companion-mode)[

NextLorebooks

](/docs/lorebooks)
