Document the new max GPU layers default in help #15771

ericcurtin · 2025-09-03T15:12:27Z

This is a key change, just letting users know.

ericcurtin · 2025-09-03T15:13:29Z

This completes #15434 by documenting the default in the help info. @JohannesGaessler @slaren PTAL

JohannesGaessler · 2025-09-03T15:21:10Z

common/arg.cpp

@@ -2466,7 +2466,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
    ).set_examples({LLAMA_EXAMPLE_SPECULATIVE, LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_N_CPU_MOE_DRAFT"));
    add_opt(common_arg(
        {"-ngl", "--gpu-layers", "--n-gpu-layers"}, "N",
-        "number of layers to store in VRAM",
+        string_format("number of layers to store in VRAM (default: %d, 999 = max layers)", params.n_gpu_layers),


I'm not happy with this wording w.r.t. "999 = max layers". Maybe "max. number of model layers to store in VRAM"?

JohannesGaessler · 2025-09-03T15:40:27Z

common/arg.cpp

@@ -2466,7 +2466,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
    ).set_examples({LLAMA_EXAMPLE_SPECULATIVE, LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_N_CPU_MOE_DRAFT"));
    add_opt(common_arg(
        {"-ngl", "--gpu-layers", "--n-gpu-layers"}, "N",
-        "number of layers to store in VRAM",
+        string_format("number of layers to store in VRAM (default: %d, 999 = max. number of model layers to store in VRAM)", params.n_gpu_layers),


Suggested change

string_format("number of layers to store in VRAM (default: %d, 999 = max. number of model layers to store in VRAM)", params.n_gpu_layers),

string_format("max. number of layers to store in VRAM (default: %d)", params.n_gpu_layers),

This is what I meant. I don't like that it kind of sounds like the value 999 is special, it's simply a high value so all layers are run on the GPU. But I'm not 100% happy with this wording either since it can be difficult to understand for people unfamiliar with the software.

How about?

string_format("max. number of layers to store in VRAM (default: %d, 999 = use max. number of layers available)", params.n_gpu_layers),

We explain the default value elsewhere in the help output.

This is a key change, just letting users know. Signed-off-by: Eric Curtin <[email protected]>

JohannesGaessler

I still don't have a better idea for how to phrase it but I think that this is an improvement over master as-is.

jacekpoplawski · 2025-09-04T10:09:40Z

It could be rephrased to say that this option now limits the number of GPU layers.

…upport * origin/master: (72 commits) metal : Add template specialization for mul_mm_id w/ ne20 == 10 (ggml-org#15799) llama : set n_outputs to 1 to avoid 0 outputs mean-pooling (ggml-org#15791) CANN: Refactor ND to NZ workspace to be per-device (ggml-org#15763) server: add exceed_context_size_error type (ggml-org#15780) Document the new max GPU layers default in help (ggml-org#15771) ggml: add ops for WAN video model (cuda && cpu) (ggml-org#15669) CANN: Fix precision issue on 310I DUO multi-devices (ggml-org#15784) opencl: add hs=40 to FA (ggml-org#15758) CANN: fix acl_rstd allocation size in ggml_cann_rms_norm (ggml-org#15760) vulkan: fix mmv subgroup16 selection (ggml-org#15775) vulkan: don't use std::string in load_shaders, to improve compile time (ggml-org#15724) vulkan : update ggml_vk_instance_validation_ext_available (ggml-org#15666) ggml vulkan: add hardsigmoid and hardswish operations (ggml-org#15762) CUDA: Optimize `rms_norm_f32` kernel and its fused variants, giving 1-6% perf E2E (ggml-org#15715) model-conversion : fix pyright errors (ggml-org#15770) sampling : optimize dist sampler (ggml-org#15704) llama : fix incorrect model type for Gemma 270M (ggml-org#15764) model-conversion : remove hardcoded /bin/bash shebangs [no ci] (ggml-org#15765) CANN: Add RoPE contiguous check for 310I DUP device (ggml-org#15735) ggml-cpu : optimize RVV kernels (ggml-org#15720) ...

…g-model-disabled-agent-prefill * origin/master: (84 commits) CUDA: fastdiv, launch bounds for mmvq + q8_1 quant (ggml-org#15802) tests : add --list-ops and --show-coverage options (ggml-org#15745) gguf: gguf_writer refactor (ggml-org#15691) kv-cache : fix SWA checks + disable cacheless iSWA (ggml-org#15811) model-conversion : add --embeddings flag to modelcard.template [no ci] (ggml-org#15801) chat : fixed crash when Hermes 2 <tool_call> had a newline before it (ggml-org#15639) chat : nemotron thinking & toolcalling support (ggml-org#15676) scripts : add Jinja tester PySide6 simple app (ggml-org#15756) llama : add support for EmbeddingGemma 300m (ggml-org#15798) metal : Add template specialization for mul_mm_id w/ ne20 == 10 (ggml-org#15799) llama : set n_outputs to 1 to avoid 0 outputs mean-pooling (ggml-org#15791) CANN: Refactor ND to NZ workspace to be per-device (ggml-org#15763) server: add exceed_context_size_error type (ggml-org#15780) Document the new max GPU layers default in help (ggml-org#15771) ggml: add ops for WAN video model (cuda && cpu) (ggml-org#15669) CANN: Fix precision issue on 310I DUO multi-devices (ggml-org#15784) opencl: add hs=40 to FA (ggml-org#15758) CANN: fix acl_rstd allocation size in ggml_cann_rms_norm (ggml-org#15760) vulkan: fix mmv subgroup16 selection (ggml-org#15775) vulkan: don't use std::string in load_shaders, to improve compile time (ggml-org#15724) ...

This is a key change, just letting users know. Signed-off-by: Eric Curtin <[email protected]>

JohannesGaessler reviewed Sep 3, 2025

View reviewed changes

ericcurtin force-pushed the document-new-ngl-default branch from b9f1fb6 to 08c531b Compare September 3, 2025 15:36

JohannesGaessler reviewed Sep 3, 2025

View reviewed changes

Document the new max GPU layers default in help

359f802

This is a key change, just letting users know. Signed-off-by: Eric Curtin <[email protected]>

ericcurtin force-pushed the document-new-ngl-default branch from 08c531b to 359f802 Compare September 3, 2025 15:52

JohannesGaessler approved these changes Sep 4, 2025

View reviewed changes

ericcurtin merged commit badb80c into master Sep 4, 2025
47 of 48 checks passed

ericcurtin deleted the document-new-ngl-default branch September 4, 2025 09:49

walidbr pushed a commit to walidbr/llama.cpp that referenced this pull request Sep 7, 2025

Document the new max GPU layers default in help (ggml-org#15771)

6efefd3

This is a key change, just letting users know. Signed-off-by: Eric Curtin <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Document the new max GPU layers default in help #15771

Document the new max GPU layers default in help #15771

Uh oh!

ericcurtin commented Sep 3, 2025

Uh oh!

ericcurtin commented Sep 3, 2025

Uh oh!

JohannesGaessler Sep 3, 2025

Uh oh!

ericcurtin Sep 3, 2025

Uh oh!

JohannesGaessler Sep 3, 2025

Uh oh!

ericcurtin Sep 3, 2025

Uh oh!

JohannesGaessler left a comment

Uh oh!

Uh oh!

jacekpoplawski commented Sep 4, 2025

Uh oh!

Uh oh!

	string_format("number of layers to store in VRAM (default: %d, 999 = max. number of model layers to store in VRAM)", params.n_gpu_layers),
	string_format("max. number of layers to store in VRAM (default: %d)", params.n_gpu_layers),

Document the new max GPU layers default in help #15771

Document the new max GPU layers default in help #15771

Uh oh!

Conversation

ericcurtin commented Sep 3, 2025

Uh oh!

ericcurtin commented Sep 3, 2025

Uh oh!

JohannesGaessler Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

ericcurtin Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

ericcurtin Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jacekpoplawski commented Sep 4, 2025

Uh oh!

Uh oh!