---
url: https://lettuceai.app/docs/models
title: "Models — LettuceAI"
description: "Learn how AI models work and how to configure temperature, top-p, top-k, and other generation parameters in LettuceAI."
---

Menu 

# Models

Models are the AI systems that generate the responses you see when you chat. When you type a message, a model reads the request, looks at the context it was given, and writes the reply.

These are sometimes called **LLMs**. In LettuceAI we just call them **models**.

## What is a model?

A model is the part that actually produces text. When you send a message, your provider runs the model, and the model sends the answer back to LettuceAI.

Different models behave differently. Some are better at creative writing, some at reasoning, some are faster, and some are cheaper.

You do not need to understand the full math. In practice, choosing a model mostly means choosing the quality, tone, speed, and price you want.

## What is an LLM?

LLM stands for **Large Language Model**. That is just the technical name for this kind of text-generating AI.

## Why choose different models?

Different models can be better for different tasks. For example:

-   some are faster
-   some are more detailed
-   some are cheaper to use
-   some have a unique tone or style

There is no single best model for everyone. You can switch depending on what kind of conversation you want.

## Do I need to configure anything?

Not necessarily. If you are unsure which model to use, the default options work fine for most conversations. You can change models later without losing chats or memory.

## Adding a model

Models live under a provider. Once you have added a provider in **Settings → Providers**, open that provider and add the models you want to use. You give each model the model name your provider expects (for example a name from their model list) and an optional display name to make it easier to recognize.

You can mark one model as the **default**. The default is used for new chats unless a character specifies its own model. Each character can have its own preferred model, which is set on the character instead of per chat.

Switching models anytime

You can change the model for a conversation at any time without losing the chat or its memory. Adding many models is fine. They only cost anything when you actually send a message.

## The model editor

Each model has its own settings page, organized into tabs. You do not have to touch any of these. The defaults work for most people.

-   **Generation:** the core settings shared by most models, like temperature, Top-P, Top-K, max output tokens, and the penalties. These are explained below.
-   **Runtime:** extra settings for local models (the built-in engine, Ollama, and similar), such as how much work goes to your GPU, thread and batch sizes, and advanced samplers. Cloud providers do not show this tab.
-   **Reasoning:** controls for models that can think before answering. See the Reasoning Mode section below.
-   **Caching:** prompt caching for providers that support it, which can lower cost on long conversations.
-   **Capabilities:** a read-only summary of what the model supports.

## Model Parameters

Some models expose extra settings that change how they generate replies. Most users do not need to touch them, but it helps to know what they mean.

### How generation works, simplified

The model does not write the whole answer in one go. It generates text one step at a time.

At each step, it looks at many possible next words or tokens and asks "which one is most likely here?"

Example: if the text so far is `The cat sat on the`, the model will usually think words like `mat`, `floor`, or `chair` make more sense than something random like `galaxy`.

Then it has to pick one. Temperature, Top-K, and Top-P change how safe or adventurous that pick can be.

Important

These settings do not make the model smarter. They only change how it chooses between possible next words.

### Temperature

Temperature is the main setting for how **predictable vs random** the output feels.

-   **Lower temperature:** safer, more stable, more repetitive
-   **Higher temperature:** more creative, looser, more chaotic

Example prompt: `The wizard opened the ancient`

With **lower temperature**, the model is more likely to continue with something obvious like `door`, `book`, or `chest`.

With **higher temperature**, it is more willing to choose something less expected like `gateway`, `tomb`, or `void`.

Lower temperature usually feels more consistent. Higher temperature can feel more vivid, but also more weird or off-track.

Beginner advice

If you only want one creativity setting, use Temperature and leave Top-K and Top-P alone.

### Top-K (if supported)

Top-K is a **hard limit** on how many next-word options the model is allowed to consider.

Example: if `Top-K = 5`, the model keeps only the 5 most likely next options and ignores everything else.

| Rank | Possible next word | Kept? |
| --- | --- | --- |
| 1   | door | Yes |
| 2   | book | Yes |
| 3   | chest | Yes |
| 4   | hall | Yes |
| 5   | room | Yes |
| 6+  | void / galaxy / thunder | No  |

So if the best options are `door`, `book`, `chest`, `hall`, and `room`, the model must choose from those. A weirder option that ranked much lower is not even allowed into the final choice pool.

Top-K is useful when you want a **strict cap** on how wide the model's choice pool can be.

Not all providers support this setting.

### Top-P

Top-P, also called **nucleus sampling**, is similar to Top-K but more flexible.

Instead of saying "keep exactly 5 options", Top-P says "keep enough of the likely options to cover most of the probability."

Example: if `Top-P = 0.90`, the model keeps adding likely next words until the combined likelihood reaches about 90%. Some moments might only need a few obvious options. Other moments might need many more.

| Word | Chance | Running total |
| --- | --- | --- |
| door | 30% | 30% |
| book | 20% | 50% |
| chest | 15% | 65% |
| hall | 10% | 75% |
| room | 8%  | 83% |
| gate | 4%  | 87% |
| tomb | 3%  | 90% |

In that example, the model stops at `tomb` because the total has reached 90%. Lower-ranked words after that are ignored.

-   **Lower Top-P:** fewer options survive, so replies feel safer
-   **Higher Top-P:** more options survive, so replies feel looser

The short version is:

-   **Top-K:** how many options can survive?
-   **Top-P:** how much of the likely option space should survive?

In practice, Top-P and Temperature both affect randomness. Most users should only adjust one of them, not both at the same time.

If you are unsure, leave Top-P at its default value. The model will behave normally without any tuning.

### Max Output Tokens

This setting controls the maximum length of the model's reply.

-   **Lower values:** shorter answers
-   **Higher values:** longer answers

Example: if you keep getting giant walls of text, lower this. If the AI keeps cutting itself off too early, raise it.

### Presence Penalty

Presence penalty pushes the model to bring in new ideas instead of staying on the same topic.

Example: if the AI keeps circling around the same scene or thought, raising presence penalty can make it introduce something new.

### Frequency Penalty

Frequency penalty tries to reduce repeated wording.

Example: if the model keeps repeating the same phrases, sentence shapes, or favorite words, a higher frequency penalty can help break that habit.

### XTC (Exclude Top Choices)

XTC is an advanced sampler for **local (llama.cpp) models** that does the opposite of the settings above. Instead of trimming the unlikely options, it occasionally removes the most **obvious** ones, nudging the model toward fresher, less predictable wording. It is great for creative writing and roleplay, and is turned **off by default**.

It has two settings:

-   **XTC Probability:** how often XTC activates. `0` means off (the default). Higher values like `0.5` apply it more often, adding more variety.
-   **XTC Threshold:** how likely a word must be to count as a "top choice" that can be dropped. The default is `0.1`. Lower is more aggressive; values above about `0.5` effectively turn it off.

A good starting point is `Probability 0.5` and `Threshold 0.1`. XTC always keeps at least one solid option, so replies stay coherent — it just avoids the predictable pick.

Leave XTC off for tasks that need precise, correct answers (code, math, factual questions), since it deliberately steers away from the most likely — and often correct — word.

## Reasoning Mode

Some models support a special reasoning mode. This lets the AI spend more effort on difficult tasks before replying.

It is mostly useful for coding, planning, analysis, and logic-heavy tasks. For normal conversation or roleplay, you usually do not need it.

Different providers expose reasoning in different ways. LettuceAI groups them into four types:

-   **Effort:** you choose how hard the model should think
-   **Budget-only:** you set a token limit for reasoning
-   **Dynamic:** the provider supports both effort and budget, but only one can be active at a time
-   **None:** reasoning controls are not available

If you enable effort or budget, the other one will be ignored for that request.

Models with Built-In Reasoning

Some AI models have built-in reasoning behavior. In those cases, your settings may be ignored if the provider does not expose control over it.

[

PreviousProviders

](/docs/providers)[

NextModel Browser & Local Inference

](/docs/model-browser)
