Tutorial: Reusing Multiple Prompt Prefixes with slots (`-np`) in `llama-server` #15530

aayush226 · 2025-08-23T19:57:07Z

aayush226
Aug 23, 2025

Motivation

Keep several different prompt prefixes hot at once, so when we rotate between them we still get fast Time To First Token (TTFT).

What hot means here:
Lets start with what a prefix is. A prefix is the shared beginning of your prompt. For example, often, this could be your system message. When a model processes a prefix one time, it can store its KV cache. The model can skip recomputing this KV cache later when you reuse the exact same prefix (same text/whitespace/punctuation). This reduces the model's TTFT significantly. llama-server can keep multiple such prefixes cached (hot) at once using slots with the help of -np.

How to run a server with multiple slots?

./llama-server -m models/tinyllama.gguf -c 8192 -np 2 --port 8080

What each term means here:

-m : which model to load
-c 8192 : context size
- np 2 : 2 cache slots, so 2 different prefixes can stay hot
--port 8080 : server listens at http://localhost:8080

Let's see this in action:

Warm 2 prefixes with cache_prompt : true so that they are stored in slots.

Prefix A:

curl -s http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local",
    "messages": [
      {"role":"system","content":"PREFIX_A: You are a teacher."},
      {"role":"user","content":"Say hello."}
    ],
    "cache_prompt": true,
    "max_tokens": 8,
    "temperature": 0
  }'

Prefix B:

curl -s http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local",
    "messages": [
      {"role":"system","content":"PREFIX_B: You are a tour guide."},
      {"role":"user","content":"Say hello."}
    ],
    "cache_prompt": true,
    "max_tokens": 8,
    "temperature": 0
  }'

Important: To guarantee a cache hit, the prefix must be exactly identical (same text, spaces, punctuation and role).

Reuse a prefix with a new suffix. Lets reuse Prefix A:

curl -s http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local",
    "messages": [
      {"role":"system","content":"PREFIX_A: You are a teacher."},
      {"role":"user","content":"Explain why rainbows form in two lines"}
    ],
    "cache_prompt": true,
    "max_tokens": 32
  }'

Now, the second A run will start faster than the first A run with a lower TTFT. If there were more prefixes and we ran the server with more slots (N was >2 in -np N) and we do the same thing for other prefixes and then rotate back to A, the TTFT for A will still be low because A's KV cache remained hot in its own slot.

Why increase -c as -np grows?
Since all slots share the same context window -c, so while initializing with a high N in -np N, if -c is too small, long prefixes will not fit and the KV cache might be evicted or truncated. So it is necessary to increase the context window -c when N in -np N is increased.

# before
./llama-server -c 1024 ...

# after (8 slots)
./llama-server -c 8192 -np 8 ...

One important clarification that we need to make at this point is that -np does not make the CPU/GPU run faster. It only provides multiple cache slots so that more prefixes can stay hot. This only improves TTFT. Tokens/second remains the same.

-np still helps even if you are running on 1 CPU, but the tokens/sec remains the same.

Summary:

Use cache_prompt: true to cache multiple prefixes.
Use -np N to keep N prefixes cached simultaneously.
Dont forget to increase -c when increasing -np so each slot can hold its prefix.
Expect lower TTFT on revisits with -np but remember tokens/sec stays the same.

aayush226 · 2025-08-23T20:05:03Z

aayush226
Aug 23, 2025
Author

Posting this here as part of contributing back to the docs. This tutorial builds on the original question from #13488 about caching multiple prompt prefixes and is picked up from the tutorials Todo from #13523 . Thought it would be useful to share this in Show and tell so others can refer to it too.

Feedback welcome!

1 reply

ggerganov Aug 27, 2025
Maintainer

Thank you for writing the tutorial - added it to the list.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tutorial: Reusing Multiple Prompt Prefixes with slots (`-np`) in `llama-server` #15530

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Tutorial: Reusing Multiple Prompt Prefixes with slots (-np) in llama-server #15530

Uh oh!

aayush226 Aug 23, 2025

Motivation

Replies: 1 comment · 1 reply

Uh oh!

Uh oh!

aayush226 Aug 23, 2025 Author

Uh oh!

ggerganov Aug 27, 2025 Maintainer

Tutorial: Reusing Multiple Prompt Prefixes with slots (`-np`) in `llama-server` #15530

aayush226
Aug 23, 2025

Replies: 1 comment 1 reply

aayush226
Aug 23, 2025
Author

ggerganov Aug 27, 2025
Maintainer