-
Notifications
You must be signed in to change notification settings - Fork 142
add hunyuan moe support for 561 #565
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I'm currently processing an imatrix and noticed that it requires This seems to be working so far, though still seems a higher than I expected which could be indicative of an problem: ./build/bin/llama-imatrix \
--verbosity 1 \
--layer-similarity \
-m /mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/Hunyuan-A13B-Instruct-BF16-00001-of-00004.gguf \
-f ubergarm-imatrix-calibration-corpus-v02.txt \
-o /mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/imatrix-Hunyuan-A13B-Instruct-BF16.dat \
-fa \
--ctx-size 512 \
-ts 48,48 \
-ngl 18 \
--threads 24
system_info: n_threads = 24 / 48 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
compute_imatrix: tokenizing the input ..
compute_imatrix: tokenization took 709.171 ms
compute_imatrix: computing over 865 chunks with batch_size 512
compute_imatrix: 4.37 seconds per pass - ETA 1 hours 3.07 minutes
[1]12.7104,[2]14.8010,[3]14.3374,[4]30.5778,[5]17.4738,[6]14.5285,[7]20.2402,[8]14.9318,[9]11.7604,
save_imatrix: stored collected data after 10 chunks in /mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/imatrix-Hunyuan-A13B-Instruct-BF16.dat
[10]12.0205,[11]10.2799,[12]12.3863,[13]14.9808,[14]16.1885,[15]16.6677,[16]20.9547,[17]19.1613,[18]17.4531,[19]15.5200,
save_imatrix: stored collected data after 20 chunks in /mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/imatrix-Hunyuan-A13B-Instruct-BF16.dat
[20]14.7222,[21]13.4574,[22]12.5603,[23]11.8334,[24]11.1943,[25]10.7840,[26]10.5614,[27]10.8168,[28]11.2630,[29]11.9753,
save_imatrix: stored collected data after 30 chunks in /mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/imatrix-Hunyuan-A13B-Instruct-BF16.dat
[30]12.7904,[31]12.8568,[32]12.7520,[33]13.2066,[34]13.7438,[35]14.3701,[36]15.2825,[37]16.4474,[38]17.2615,[39]17.7246,
save_imatrix: stored collected data after 40 chunks in /mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/imatrix-Hunyuan-A13B-Instruct-BF16.dat
[40]20.3797,[41]22.3074,[42]22.9196,[43]23.5967,[44]24.9652,[45]26.3450,[46]28.0728,[47]28.1975,[48]27.9526,[49]31.3467,
save_imatrix: stored collected data after 50 chunks in /mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/imatrix-Hunyuan-A13B-Instruct-BF16.dat
[50]30.1730,[51]31.2195,[52]30.6089,[53]30.0938,[54]29.5127,[55]29.9680,[56]29.2944,[57]28.2416,[58]27.2467,[59]26.2110,
save_imatrix: stored collected data after 60 chunks in /mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/imatrix-Hunyuan-A13B-Instruct-BF16.dat
[60]25.3394,[61]24.4437,[62]23.7538,[63]25.8637,[64]27.0096,[65]28.0507,[66]27.7521,[67]29.0344,[68]29.8659,[69]30.3886,
save_imatrix: stored collected data after 70 chunks in /mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/imatrix-Hunyuan-A13B-Instruct-BF16.dat
[70]31.4350,[71]31.8531,[72]31.7906,[73]31.7912,[74]32.9230,[75]34.9214,[76]37.0384,[77]38.7590,[78]38.9847,[79]40.2656,
save_imatrix: stored collected data after 80 chunks in /mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/imatrix-Hunyuan-A13B-Instruct-BF16.dat
[80]41.5627,[81]41.0075,[82]42.5855,[83]44.5075,[84]43.9110,[85]43.3078,[86]42.7130,[87]41.7924,[88]41.2850,[89]41.5686,
save_imatrix: stored collected data after 90 chunks in /mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/imatrix-Hunyuan-A13B-Instruct-BF16.dat
[90]40.8182,[91]41.2610,[92]42.4782,[93]44.0758,[94]43.5943,[95]43.7613,[96]43.0079,[97]42.6615,[98]43.6499,[99]43.1762,
save_imatrix: stored collected data after 100 chunks in /mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/imatrix-Hunyuan-A13B-Instruct-BF16.dat
[100]42.4092,[101]43.1918,[102]44.5605,[103]44.1737,[104]44.2998,[105]45.3024,[106]45.5803,[107]45.3388,[108]45.5154,[109]45.8490,
save_imatrix: stored collected data after 110 chunks in /mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/imatrix-Hunyuan-A13B-Instruct-BF16.dat
[110]45.6819,[111]46.1607,[112]46.8070,[113]47.5833,[114]48.5492,[115]48.9797,[116]49.6842,[117]49.8659,[118]51.1640,[119]51.3824,
save_imatrix: stored collected data after 120 chunks in /mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/imatrix-Hunyuan-A13B-Instruct-BF16.dat
[120]52.0141,[121]53.6073,[122]55.3684,[123]56.2596,[124]56.0548,[125]56.1662,[126]56.3532,[127]57.2403,[128]56.6770,[129]58.3851,
save_imatrix: stored collected data after 130 chunks in /mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/imatrix-Hunyuan-A13B-Instruct-BF16.dat
[130]58.2333,[131]59.2614,[132]60.7497,[133]62.4619,[134]63.7352,[135]64.8522,[136]66.5478,[137]64.9457,[138]63.5455,[139]63.2199,
save_imatrix: stored collected data after 140 chunks in /mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/imatrix-Hunyuan-A13B-Instruct-BF16.dat
... EDIT: I also tried adding this model to the list for fwiw this seems to be similar numbers as I'm getting using mainline llama-imatrix:
|
No FA and FA giving very different PPL values is not a good sign. PPL of 60 is not a good sign either, especially for a model of that size. |
I'm going to leave an endpoint up for a little bit if anyone wants to try the first experimental quant.. No promises lol EndpointWebUI: https://llm.ubergarm.com/ There are 8 concurrent slots each with 64k prompt limit. Test QuantI just rolled an imatrix.dat and made my first quant for testing.
How I ran it: model=/mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/Hunyuan-A13B-Instruct-IQ4_K.gguf
./build/bin/llama-server \
--model "$model" \
--alias ubergarm/Hunyuan-A13B-Instruct-IQ4_K \
-fa \
-ctk q8_0 -ctv q8_0 \
-c 524288 \
--temp 0.6 \
--presence-penalty 0.7 \
--min-p 0.1 \
-ts 48,48 \
-ngl 99 \
--parallel 8 \
--threads 1 \
--host 127.0.0.1 \
--port 8080 |
src/llama.cpp
Outdated
|
||
Qcur = ggml_reshape_3d(ctx0, Qcur, n_embd_head, n_head, n_tokens); | ||
Kcur = ggml_reshape_3d(ctx0, Kcur, n_embd_head, n_head_kv, n_tokens); | ||
Vcur = ggml_reshape_3d(ctx0, Vcur, n_embd_head, n_head_kv, n_tokens); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you check your previous PR about GLM4 you will see that you had to remove the Vcur
reshaping. It is the same here. Remove this line and it is likely the difference between FA and no FA will go away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, thanks for the reminder! The two trickiest parts of porting an architecture is remembering to:
- Remove the
Vcur
reshaping. - On mainline
build_attn()
the argument order goesQcur, Kcur, Vcur,
, but here withllm_build_kv()
the order goesKcur, Vcur, Qcur,
.
Just re-downloaded the new .safetensors, converted, and built a fresh quant to test:
FA=1
Final estimate: PPL = 522.7473 +/- 5.68072FA=0
Final estimate: PPL = 527.6625 +/- 5.73144
So looks "good" now haha... I didn't wait to find the bf16's PPL but this lines up in the ball-park with what mainline is seeing around ~500.
Of course I couldn't help myself and had to try out the new IQ3_KS quant as well lol...
So far so good!
llm_load_print_meta: model type = 80B.A13B
llm_load_print_meta: model ftype = IQ3_KS - 3.1875 bpw
llm_load_print_meta: model params = 80.393 B
llm_load_print_meta: model size = 34.088 GiB (3.642 BPW)
llm_load_print_meta: general.name = Hunyuan A13B Instruct
# Attention
blk\..*\.attn_k.*=iq6_k
blk\..*\.attn_v.*=iq6_k
blk\..*\.attn_q.*=iq5_k
blk\..*\.attn_o.*=iq5_k
# 1x Shared Expert
blk\..*\.ffn_(down)_shexp.*=iq6_k
blk\..*\.ffn_(gate|up)_shexp.*=iq5_k
# 64x Routed Experts
blk\..*\.ffn_(down)_exps.*=iq4_ks
blk\..*\.ffn_(gate|up)_exps.*=iq3_ks # let's live dangerously
# Token Embedding
token_embd\.weight=iq6_k # splurged here a bit as this model's tokenization seems wierd
tested on your api, it works for Chinese Q&A. |
Ahh very good, thanks you. Tonight I was running my updated experimental IQ3_KS which I went ahead and released prematurely because oh well it seems okay lol... Thanks for testing! It can fit 256k context in under 24GB VRAM when not offloading additional exps and with |
run on wsl I got a error: Floating point exception (core dumped), in the initial procress of ik_llama.cpp
OS: Win 11 +WSL |
Its becase I'm a madman and released a quant depending on two unmerged PRs. Check here for instructions how to get the IQ3_KS PR here: https://huggingface.co/ubergarm/Hunyuan-A13B-Instruct-GGUF#note-building-experimental-prs Also @kiron111 look at the examples on the model card you will need to use This model is great for low VRAM machines and can probably run in 6GB VRAM with some usable context. |
thankyou |
The PPL of 500+ is not very promising. I suspect this is because of the not implemented technique to reduce the importance of recently used experts, which completely modifies the inference compared to how the model was trained, that was discussed in the mainline PR. Hence still wondering if to merge. They have merged as is in mainline, but |
c6c23fa
to
89c3210
Compare
Looking more closely, yes I see that the official pytorch reference MoE routing "capacity" mechanism is not seem implemented in the build_moe_ffn() code. The mainline PR I'll try quanting from the Pretrain version just to see how it performs, given that bf16 scores much lower PPL oddly enough:
EDIT
|
89c3210
to
90bc417
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, lets merge this.
Based this PR on mainline ggml-org/llama.cpp#14425. Didn't merge any python stuff (used mainline convert script). Tested with bf16 on hybrid CUDA+CPU.
model=/mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/Hunyuan-A13B-Instruct-BF16-00001-of-00004.gguf ./build/bin/llama-server \ --model "$model" \ --alias ubergarm/Hunyuan-A13B-Instruct-bf16 \ -fa \ -ctk q8_0 -ctv q8_0 \ -c 8192 \ --temp 0.6 \ --presence-penalty 0.7 \ --min-p 0.1 \ -ts 48,48 \ -ngl 16 \ --threads 24 \ --host 127.0.0.1 \ --port 8080
Would be great if anyone else could test e.g. @Downtown-Case as per #561
I haven't yet made imatrix nor tried to quantize further.
Might be able to use one of the following if was converted recently enough:
The behavior seems a bit odd and will answer in chinese if I don't use some kind of system prompt or explicitly say speak in english. Mainline seems to use some kind of
--jinja
thing which isn't supported here psure. So ymmv.