-
-
Notifications
You must be signed in to change notification settings - Fork 86
Open
Description
Feature Request
I’m using LLM.swift for on-device inference in a macOS app. I noticed that there is currently no exposed way to reset the model context or KV cache between runs without reinitializing the entire model. This becomes an issue in apps that perform multiple inferences in sequence (e.g., analyzing transcript segments or handling multiple short prompts in a session).
Problem
Calling .predict(...)
multiple times on the same LLMModel
instance can cause:
- Errors like:
- decode: failed to find KV cache slot for ubatch of size 512
- llama_decode: failed to decode, ret = 1
- Memory buildup or inference inconsistencies over time.
- Reinitializing the model every time is inefficient and causes performance degradation, especially on smaller prompts.
Proposed Solution
Expose a method on LLMModel
like this:
public func resetContext() {
llama_reset(context)
}
Metadata
Metadata
Assignees
Labels
No labels