-
Notifications
You must be signed in to change notification settings - Fork 13k
CANN: fix acl_rstd allocation size in ggml_cann_rms_norm #15760
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@yuchuan-cao, could you review my code? |
Opt Test:
|
de19f0a
to
8896981
Compare
Fixes ggml-org#15330 Adjust the allocation size of acl_rstd. The parameter `dims` is set to 3 according to the CANN documentation. Co-authored-by: Yuchuan <[email protected]>
8896981
to
e1d137a
Compare
} | ||
aclTensor* acl_rstd = get_f32_cache_acl_tensor( | ||
ctx, | ||
&ctx.rms_norm_zero_tensor_cache.cache, | ||
ctx.rms_norm_zero_tensor_cache.size, | ||
src->ne, | ||
acl_rstd_ne, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can jsut use src as rstd_ne, acl_rstd_ne array is not necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That approach won’t work. If we simply reuse src->ne, then for a 3D tensor the resulting dimensions would be {ne0, ne1, ne2}, whereas what we actually need are {ne1, ne2, ne3}.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose preparing src->ne
sized buffer would work fine, and preparing acl_rstd_ne
sized buffer would be a more memory-efficient solution.
I suggest you double-check the changes in llama-bench
. The confusing thing is that the memory problem in #15330 doesn't occur in test-backend-ops
and llama-cli
. It only emerges when running llama-bench
. I failed to find out why.
@yuchuan-cao, Thanks for the suggestion. I ran the
|
…upport * origin/master: (72 commits) metal : Add template specialization for mul_mm_id w/ ne20 == 10 (ggml-org#15799) llama : set n_outputs to 1 to avoid 0 outputs mean-pooling (ggml-org#15791) CANN: Refactor ND to NZ workspace to be per-device (ggml-org#15763) server: add exceed_context_size_error type (ggml-org#15780) Document the new max GPU layers default in help (ggml-org#15771) ggml: add ops for WAN video model (cuda && cpu) (ggml-org#15669) CANN: Fix precision issue on 310I DUO multi-devices (ggml-org#15784) opencl: add hs=40 to FA (ggml-org#15758) CANN: fix acl_rstd allocation size in ggml_cann_rms_norm (ggml-org#15760) vulkan: fix mmv subgroup16 selection (ggml-org#15775) vulkan: don't use std::string in load_shaders, to improve compile time (ggml-org#15724) vulkan : update ggml_vk_instance_validation_ext_available (ggml-org#15666) ggml vulkan: add hardsigmoid and hardswish operations (ggml-org#15762) CUDA: Optimize `rms_norm_f32` kernel and its fused variants, giving 1-6% perf E2E (ggml-org#15715) model-conversion : fix pyright errors (ggml-org#15770) sampling : optimize dist sampler (ggml-org#15704) llama : fix incorrect model type for Gemma 270M (ggml-org#15764) model-conversion : remove hardcoded /bin/bash shebangs [no ci] (ggml-org#15765) CANN: Add RoPE contiguous check for 310I DUP device (ggml-org#15735) ggml-cpu : optimize RVV kernels (ggml-org#15720) ...
…g-model-disabled-agent-prefill * origin/master: (84 commits) CUDA: fastdiv, launch bounds for mmvq + q8_1 quant (ggml-org#15802) tests : add --list-ops and --show-coverage options (ggml-org#15745) gguf: gguf_writer refactor (ggml-org#15691) kv-cache : fix SWA checks + disable cacheless iSWA (ggml-org#15811) model-conversion : add --embeddings flag to modelcard.template [no ci] (ggml-org#15801) chat : fixed crash when Hermes 2 <tool_call> had a newline before it (ggml-org#15639) chat : nemotron thinking & toolcalling support (ggml-org#15676) scripts : add Jinja tester PySide6 simple app (ggml-org#15756) llama : add support for EmbeddingGemma 300m (ggml-org#15798) metal : Add template specialization for mul_mm_id w/ ne20 == 10 (ggml-org#15799) llama : set n_outputs to 1 to avoid 0 outputs mean-pooling (ggml-org#15791) CANN: Refactor ND to NZ workspace to be per-device (ggml-org#15763) server: add exceed_context_size_error type (ggml-org#15780) Document the new max GPU layers default in help (ggml-org#15771) ggml: add ops for WAN video model (cuda && cpu) (ggml-org#15669) CANN: Fix precision issue on 310I DUO multi-devices (ggml-org#15784) opencl: add hs=40 to FA (ggml-org#15758) CANN: fix acl_rstd allocation size in ggml_cann_rms_norm (ggml-org#15760) vulkan: fix mmv subgroup16 selection (ggml-org#15775) vulkan: don't use std::string in load_shaders, to improve compile time (ggml-org#15724) ...
) Fixes ggml-org#15330 Adjust the allocation size of acl_rstd. The parameter `dims` is set to 3 according to the CANN documentation. Co-authored-by: Yuchuan <[email protected]>
Fixes #15330
Adjust the allocation size of acl_rstd. The parameter
dims
is set to 3 according to the CANN documentation.Co-authored-by: Yuchuan [email protected]