-
Notifications
You must be signed in to change notification settings - Fork 13k
Description
Name and Version
G:\llama_cpp> llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from G:\llama_cpp\ggml-cuda.dll
load_backend: loaded RPC backend from G:\llama_cpp\ggml-rpc.dll
load_backend: loaded CPU backend from G:\llama_cpp\ggml-cpu-haswell.dll
version: 6387 (4fd1242)
built with clang version 19.1.5 for x86_64-pc-windows-msvc
Operating systems
Windows
GGML backends
CUDA
Hardware
NVIDIA GeForce RTX 3060
Models
Voxtral-Mini-3B-2507-GGUF:Q4_K_M and mistralai_Voxtral-Small-24B-2507-GGUF
Problem description & steps to reproduce
Voxtral-Mini-3B-2507-GGUF works via llama-mtmd-cli but fails via llama-server.
This command works (taken from the tests): llama-mtmd-cli -hf ggml-org/Voxtral-Mini-3B-2507-GGUF:Q4_K_M --image test-2.mp3 -p "what is the publisher name of the newspaper?" --temp 0 -n 128
API request to the llama-server fails with: server.cpp:3562: GGML_ASSERT(batch.n_tokens > 0) failed
message. Full log below.
Similar bug reports there #13433
First Bad Commit
No response
Relevant log output
srv log_server_r: request: GET /props 192.168.1.1 200
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 131072, n_keep = 0, n_prompt_tokens = 190
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 3, n_tokens = 3, progress = 0.015789
slot update_slots: id 0 | task 0 | kv cache rm [3, end)
srv process_chun: processing audio...
encoding audio slice...
audio slice encoded in 5542 ms
decoding audio batch 1/1, n_tokens_batch = 187
audio decoded (batch 1/1) in 15 ms
srv process_chun: audio processed in 5560 ms
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 190, n_tokens = 0, progress = 1.000000
D:/a/llama.cpp/llama.cpp/tools/server/server.cpp:3562: GGML_ASSERT(batch.n_tokens > 0) failed