---
url: https://lettuceai.app/docs/memory
title: "Memory System — LettuceAI"
description: "Keep long conversations coherent with manual or dynamic memory using embedding-based retrieval and automatic maintenance."
---

Menu 

# Memory System

LettuceAI keeps long conversations coherent by separating recent chat context from long-term memory. It does not replay your entire history on every turn.

## Memory in plain English

Memory is how LettuceAI helps the AI remember important old details without resending your entire chat every time.

-   **Manual Memory**: you write fixed notes and the AI sees them every turn
-   **Dynamic Memory**: the app tries to pull back the most relevant old details automatically

What most users need to know

If you want simple control, use Manual Memory. If you want longer chats with less babysitting, use Dynamic Memory. The deeper retrieval and maintenance details below are mostly for advanced users.

There are two operating modes. Manual Memory is explicit and predictable. Dynamic Memory is selective: it stores many facts internally, retrieves only the most useful ones per turn, and periodically maintains that memory set in the background.

Direct and group chats

Direct chats and group chats can use different Dynamic Memory settings. Direct chats also depend on the character using Dynamic Memory, while group chats use the memory mode selected for that group session.

## Modes at a glance

| Mode | How it works | Best for |
| --- | --- | --- |
| Manual | You maintain a fixed list of memories that is injected every turn. | Short chats, strict control, small stable fact sets. |
| Dynamic | LettuceAI stores embeddings, retrieves relevant items, and runs a maintenance cycle every few new messages. | Long-running chats, roleplay continuity, lower prompt growth. |

## Manual Memory

Manual Memory behaves like a fixed note sheet. The model sees the full manual memory list on every turn, alongside the recent conversation window.

-   You write the memories yourself.
-   The list stays stable until you edit or delete it.
-   Prompt cost grows with the number and size of manual memories.

## Dynamic Memory

Dynamic Memory is not one giant prompt block. It is a small local memory subsystem with three main pieces of state:

Quick translation

If terms like **embeddings**, **semantic retrieval**, or **cosine mode** sound too technical, the simple version is: the app compares meaning, tries to find relevant old memories, and injects only the useful ones.

-   **Memory entries**: short factual items with embeddings, heat state, access counts, optional pinning, and a category tag.
-   **Context summary**: a compressed summary of processed conversation windows.
-   **Memory events**: a log of each maintenance cycle so the app knows what range of messages has already been processed.

![Diagram showing the three main pieces of Dynamic Memory state](https://lhdgeo5fms.ufs.sh/f/m0TBUtMLsaiExmjHl3mUkw2MN5G1QtVTFbq6Audpci7RZUDX)

Dynamic Memory keeps three kinds of state: the memory entries themselves, a rolling context summary, and a cycle log that tracks processed ranges.

### What happens on each chat turn

When you send a message, LettuceAI first builds the normal prompt from recent conversation, system prompts, persona data, and lorebook content. If Dynamic Memory is active, it also runs memory retrieval.

-   By default, retrieval uses a **smart** strategy: semantic search over hot memories plus a small recency/frequency bias.
-   You can switch to **cosine** mode for pure similarity ranking.
-   Only the retrieved memories are injected into the prompt, not the full memory database.
-   Retrieved memories are marked as accessed, which refreshes their heat and can promote cold items back to hot.

![Diagram showing the per-turn Dynamic Memory flow](https://lhdgeo5fms.ufs.sh/f/m0TBUtMLsaiE29TgWm0CPqXAWsoHbY7GZ0FzOUpuygEvIwme)

Per-turn retrieval is narrow: build the prompt, pull only relevant memories if Dynamic Memory is active, send the final prompt, then save the reply.

### Hot vs cold memory

Each dynamic entry has an importance score. Hot memories are the active working set and surface easily during retrieval. Cold memories are not removed from search, but they count for less, so they only resurface when they are clearly relevant.

-   **Hot**: active working memory, searched at full strength.
-   **Cold**: kept locally and still searchable, but weighted lower so it stays out of the way until it is a strong match.
-   **Pinned**: protected from decay and forced to stay hot.

Cooling does not delete a memory by itself. It only changes how strongly the entry competes during retrieval. A cold memory that turns out to be relevant is pulled back and becomes hot again.

![Diagram showing how a memory moves between hot and cold states](https://lhdgeo5fms.ufs.sh/f/m0TBUtMLsaiEBCTJVYXeaxpi2W1v7t9cGgFBwm4kZdrEyDTJ)

Heat changes over time. Retrieved memories stay fresh, neglected ones cool down, and cold memories can become hot again when they are found later.

### How retrieval actually works

The retrieval path is more selective than older descriptions of the system.

1.  The app embeds the search query locally. With context enrichment enabled, that query can include more than just the latest message.
2.  It ranks your stored memories by relevance. Hot memories are searched at full strength and cold memories are included too, just weighted lower so they only surface when they are a strong match.
3.  In smart mode, if there are still open slots after ranking, it back fills them with a recent memory or a frequently used one to keep recall stable.
4.  As a last resort, if the relevance search comes up empty, it runs a plain keyword scan as a safety net.
5.  Any cold memory that is retrieved becomes hot again.

![Diagram showing the retrieval decision path](https://lhdgeo5fms.ufs.sh/f/m0TBUtMLsaiEhN9IRQYy8QKsz1bUcRixZ7tHBO4qoCruLYPV)

The retrieval pass starts with semantic search over hot memories and only falls back to cold-storage keyword matching when the semantic pass comes up empty.

Important

Dynamic Memory ranks your memories by relevance and only injects a small top slice into the prompt, never the whole database. The hot/cold split decides how strongly each memory competes, which keeps recall focused and the prompt small.

### Time-aware retrieval (companion mode)

When a companion chat has time awareness enabled, retrieval gets an extra step before scoring. The runtime parses temporal phrases out of the search query (things like "yesterday", "last week", "this month", "two fridays ago", "five weeks ago today", "in the last three days") and resolves them to a local date range.

-   If a range is detected, retrieval first narrows candidates to memories whose `observed_at` timestamp falls inside that range, then ranks them. If nothing falls in the range, retrieval returns empty rather than padding with off-topic entries.
-   When time awareness is on, new memories created during a turn are stamped with an `observed_at` taken from the latest message, which is what makes temporal filtering possible later. Without it, memories carry no event timestamp.
-   The ranker also adds a small lexical-overlap boost on top of cosine similarity, so concrete anchors in the query (names, places) reliably surface memories that mention the same things.
-   Memories that have an `observed_at` are rendered into the prompt with a short "observed YYYY-MM-DD HH:MM TZ" suffix.

Time-aware retrieval is companion-only today. Non-companion chats and roleplay flows skip the temporal phrase parser and run plain semantic retrieval. See **Companion Mode** for the per-session toggle and the time placeholder list.

### Background maintenance cycle

Retrieval happens every turn. Memory maintenance does not. Instead, the app waits until enough new user or assistant messages have accumulated since the last processed window.

1.  After the configured message interval, LettuceAI selects the next unsummarized conversation window.
2.  A summarisation model generates a merged summary for that new window.
3.  A tool-driven pass can **create**, **delete**, **pin**, or **unpin** memory entries.
4.  The app stores the updated summary, the new memory set, and the event cursor so the next cycle continues from the right point.

You do not have to wait for the interval. You can also run a maintenance pass on demand, or have the app check with you first. See **When memory updates run** below.

Structured fallback format

Some models do not reliably emit tool calls during the memory pass. When that happens, LettuceAI falls back to a structured text format (XML by default, JSON optional) so it can still extract memory create, delete, pin, and unpin actions from the model output.

![Diagram showing the Dynamic Memory maintenance cycle](https://lhdgeo5fms.ufs.sh/f/m0TBUtMLsaiELhtqdUum4E1s7fKcbIrtgJlMzOkUhDv6GqQX)

The background cycle is separate from the live chat turn: it summarizes a new window, runs memory tools, and advances the stored cursor.

## When memory updates run

By default the maintenance cycle runs on its own, but you can decide how much say you get. Dynamic Memory has three run modes, set separately for direct chats and group chats in **Settings → Dynamic Memory** under **Memory Updates**.

| Run mode | What it does |
| --- | --- |
| Automatic | The default. Memory updates run on their own in the background once enough new messages have built up. |
| Ask First | The app asks before each update so you can start it now or skip it and roll those messages into the next one. |
| Manual | Updates never run on their own. They only happen when you trigger them from the memory panel. |

Per chat type

Direct chats and group chats each have their own run mode. Changing one does not change the other.

### The Ask First approval prompt

In Ask First mode, once a full batch of new messages is ready, a small sheet appears titled **Update Memory?** It tells you how many new messages are waiting and gives you two choices:

-   **Start Now**: run the memory update immediately.
-   **Skip This Cycle**: do nothing for now. The waiting messages are not lost, they simply roll into the next update.

The prompt appears only once per batch, so it will not nag you on every message. If you close the chat with a prompt still pending, it reappears the next time you open that chat.

### Running a cycle yourself

In any mode you can run a memory update by hand from the memory panel. This is the only way updates happen in Manual mode, and it is handy in the other modes when you want to refresh memory right now.

-   Open the chat's memory panel and use the **Run** button (in companion chats this is labeled **Process memory**).
-   A progress bar shows the four stages of the cycle: summarizing the conversation, analyzing memories, applying changes, and organizing memories. A live token counter and speed readout appear while the model works, plus a warning if the model goes quiet.
-   The same button becomes **Cancel** while a cycle is running, so you can stop it at any time.
-   With developer mode on, a **View live output** button lets you watch the model's raw text stream in as the memory is generated.

### Editing the context summary

The context summary is the short recap Dynamic Memory keeps to hold a long conversation together. You do not have to leave it entirely to the app. In the memory panel, tap the **Context Summary** card to open an editor, rewrite the recap in your own words, and save. This works for both regular dynamic chats and companion chats.

### Cleaner memory generation

When a local (on-device) model writes your memories, some models can get stuck repeating the same phrase. Dynamic Memory ships with a built-in, loop-resistant setup that discourages this repetition so memory updates stay clean. It is on by default and can be turned off in the advanced dynamic memory settings (**Overwrite Sampler Configuration**) if you would rather use your model's own settings.

### Custom summarizer and memory-manager prompts

Advanced users can replace the two prompts that drive the maintenance cycle. Both live in **Settings → Dynamic Memory** under **Summarisation**, and both default to **Use built-in default**.

-   **Summary Prompt**: controls how recent conversation turns are condensed into the context summary.
-   **Memory Manager Prompt**: controls how memories are added, updated, and deleted, for both direct and group chats.

You write the actual prompt text as a template in **Settings → System Prompts** (using the Dynamic Memory Summarizer and Dynamic Memory Manager template types), then choose it from these dropdowns.

### Decay, budgets, and deletion

Dynamic Memory uses multiple controls, not just one:

-   **Decay rate**: lowers the importance of hot, unpinned memories across maintenance cycles.
-   **Cold threshold**: entries below this threshold move to cold storage.
-   **Hot memory token budget**: if too many hot memories are active, older unpinned ones are demoted to cold.
-   **Max entries**: if the memory set grows too large, lower-priority unpinned memories can be trimmed entirely, chosen by a mix of importance and recency rather than age alone.

That last point matters: Dynamic Memory can delete or soft-delete entries during maintenance. Cooling alone is not deletion, but the full system is allowed to remove memories when it decides they are stale, wrong, or lower priority than newer ones.

## Embedding models

Dynamic Memory depends on a local embedding model so the app can compare meaning, not just exact wording.

**v4** is the current default. Older **v1**, **v2**, and **v3** installs may still exist on some devices, but they are now legacy variants.

| Version | Status | Context support | Notes |
| --- | --- | --- | --- |
| v1  | Legacy | 512 tokens | Oldest model. Limited long-context support. |
| v2  | Legacy | Up to 4096 tokens | Major improvement over v1. |
| v3  | Legacy | Up to 4096 tokens | Previous default. Still works, but v4 is preferred for new installs. |
| v4  | Current default | Up to 4096 tokens | 768-dimension model. Recommended for new installs and upgrades. |

On v3 and v4, you can set the local embedding capacity anywhere between 512 and 4096 tokens depending on your memory quality and performance preferences.

### Downloading the embedding model

Embedding models run on your device, so they have to be downloaded before Dynamic Memory can work. The embedding download page walks you through this in two steps:

1.  **Pick a capacity**: choose how many tokens of context the local model should support per memory.
    -   **1K tokens**: smallest and fastest. Best for quick responses and lower-end devices.
    -   **2K tokens**: balanced default. Good quality without much extra cost.
    -   **4K tokens**: maximum context. Best memory quality, uses more memory and CPU per embed.
2.  **Download**: the v4 model files (about 138 MB) are fetched in the background. Progress is shown live, and you can cancel or retry if the download fails.

After the download finishes, the app runs a short self-test to confirm embeddings work end to end before enabling Dynamic Memory. You can upgrade from an older model version (v1, v2, v3) using the same flow.

Switching capacity later

Capacity is chosen per install. If you want to change it later, you can re-run the download flow and pick a different size.

### Upgrading or changing the model

Every memory is stored as plain text alongside its meaning fingerprint, so the text is always the source of truth. If you switch the embedding model version or change its dimensions, those fingerprints have to be rebuilt from the saved text so old and new memories stay comparable.

-   Re-embedding happens quietly the next time each chat uses memory, not all at once. A small **Migrating memory vectors** notice shows progress, and replies may be delayed briefly while it runs.
-   Progress is saved as it goes. If the app closes or one memory fails to rebuild, the next run picks up where it left off instead of starting over, so it can no longer get stuck redoing the same work.
-   Memories left behind by deleted chats, groups, or characters are cleaned up automatically so they do not pile up.

### Context enrichment

Context enrichment improves retrieval by embedding a short contextual slice instead of only the latest message. That helps when the newest user line contains pronouns, callbacks, or implied references.

-   Better follow-up recall
-   Better disambiguation of names and pronouns
-   Better stability when the conversation jumps topics quickly

Local-first

Embeddings are generated and stored locally. Providers only see memory text if that memory is actually injected into the prompt for a turn.

## Token usage and cost

Dynamic Memory usually reduces prompt growth, but it is not free.

-   Retrieved memories still cost tokens when they are injected into a prompt.
-   The summarisation and memory-tool passes also consume tokens on the summarisation model you selected.
-   Manual Memory has the simplest behavior, but its prompt cost rises linearly because the whole memory list is sent every turn.

The reason Dynamic Memory is usually cheaper is not that it costs nothing. It is cheaper because it sends a small relevant subset most of the time instead of replaying everything.

## Managing memories

You still keep control, even in Dynamic Mode.

-   In Manual Mode, you add, edit, and delete the fixed list yourself.
-   In Dynamic Mode, you can review entries, pin important ones, delete incorrect ones, and manually add missing facts.

## Which mode should you use?

-   **Use Manual Memory** if you want a short, stable, always visible rule list.
-   **Use Dynamic Memory** if you want better continuity across long conversations without letting prompt size grow forever.

[

PreviousCompanion Mode

](/docs/companion-mode)[

NextLorebooks

](/docs/lorebooks)
