Tutorial: Reusing Multiple Prompt Prefixes with slots (-np
) in llama-server
#15530
aayush226
started this conversation in
Show and tell
Replies: 1 comment 1 reply
-
Posting this here as part of contributing back to the docs. This tutorial builds on the original question from #13488 about caching multiple prompt prefixes and is picked up from the tutorials Todo from #13523 . Thought it would be useful to share this in Show and tell so others can refer to it too. Feedback welcome! |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Motivation
Keep several different prompt prefixes hot at once, so when we rotate between them we still get fast Time To First Token (TTFT).
What
hot
means here:Lets start with what a prefix is. A prefix is the shared beginning of your prompt. For example, often, this could be your system message. When a model processes a prefix one time, it can store its
KV cache
. The model can skip recomputing this KV cache later when you reuse the exact same prefix (same text/whitespace/punctuation). This reduces the model's TTFT significantly.llama-server
can keep multiple such prefixes cached (hot) at once usingslots
with the help of-np
.How to run a server with multiple slots?
What each term means here:
hot
Let's see this in action:
Warm 2 prefixes with
cache_prompt : true
so that they are stored in slots.Prefix A:
Prefix B:
Important: To guarantee a cache hit, the prefix must be exactly identical (same text, spaces, punctuation and role).
Reuse a prefix with a new suffix. Lets reuse Prefix A:
Now, the second A run will start faster than the first A run with a lower TTFT. If there were more prefixes and we ran the server with more slots (N was >2 in -np N) and we do the same thing for other prefixes and then rotate back to A, the TTFT for A will still be low because A's KV cache remained hot in its own slot.
Why increase -c as -np grows?
Since all slots share the same context window
-c
, so while initializing with a high N in-np N
, if-c
is too small, long prefixes will not fit and the KV cache might be evicted or truncated. So it is necessary to increase the context window-c
when N in-np N
is increased.One important clarification that we need to make at this point is that
-np
does not make the CPU/GPU run faster. It only provides multiple cache slots so that more prefixes can stay hot. This only improves TTFT. Tokens/second remains the same.-np
still helps even if you are running on 1 CPU, but the tokens/sec remains the same.Summary:
Use
cache_prompt: true
to cache multiple prefixes.Use
-np N
to keep N prefixes cached simultaneously.Dont forget to increase -c when increasing -np so each slot can hold its prefix.
Expect lower TTFT on revisits with
-np
but remember tokens/sec stays the same.Beta Was this translation helpful? Give feedback.
All reactions