Releases: intel/xFasterTransformer
v1.7.0 - Continuous batching feature supported.
v1.7.0 - Continuous batching feature supported.
Functionality
- Refactor framework to support continuous batching feature.
vllm-xft
, a fork of vllm, integrates the xFasterTransformer backend and maintains compatibility with most of the official vLLM's features. - Remove FP32 data type option of KV Cache.
- Add
get_env()
python API to get recommended LD_PRELOAD set. - Add GPU build option for Intel Arc GPU series.
- Exposed the interface of the LLaMA model, including Attention and decoder.
Performance
- Update xDNN to release
v1.5.1
- Baichuan series models supports full FP16 pipline to improve performance.
- More FP16 data type kernel added, including MHA, MLP, YARN rotary_embedding, rmsnorm and rope.
- Kernel implementation of crossAttnByHead.
Dependency
- Bump
torch
to2.3.0
.
BUG fix
- Fixed the segament fault error when running with more than 4 ranks.
- Fixed the bugs of core dump && hang when running croos nodes.
What's Changed
Generated release nots
- [Fix] add utf-8 encoding. by @marvin-Yu in #354
- [Benchmark] Calculate throughput using avg latency. by @Duyi-Wang in #360
- [GPU] Add GPU build option. by @changqi1 in #359
- Fix Qwen prompt.json by @JunxiChhen in #368
- [Model] Fix ICX build issue. by @changqi1 in #370
- [CMake] Remove evaluation under XFT_BUILD_TESTS option. by @Duyi-Wang in #374
- [Kernel][UT] Kernel impl. of crossAttnByHead and unit test for cross attention. by @pujiang2018 in #348
- [API] Add LLaMA attention API. by @changqi1 in #378
- [Finetune] Scripts for Llama2-7b lora finetune example using stock pytorch by @ustcuna in #327
- [Demo] Add abbreviation for output length. by @Duyi-Wang in #385
- [API] Add LLaMA decoder API. by @changqi1 in #386
- [API] Optimize API Impl. by @changqi1 in #396
- [Framework] Continuous Batching Support by @pujiang2018 in #357
- [KVCache] Remove FP32 data type. by @Duyi-Wang in #399
- [Interface] Change return shape of forward_cb. by @Duyi-Wang in #400
- [Example] Add demo of offline continuous batching by @pujiang2018 in #401
- [Layers] Add alibiSlopes Attn && Flash Attn for CB. by @abenmao in #402
- [Interface] Support List[int] and List[List[int]] for set_input_sb. by @Duyi-Wang in #404
- [Bug] fix incorrect input offset computing by @pujiang2018 in #405
- [Example] Fix incorrect tensor dimension with latest interface by @pujiang2018 in #406
- [Models/Layers/Kernels] Add Baichuan1/2 full-link bf16 support & Fix next-tok gen bug by @abenmao in #407
- [xDNN] Release v1.5.0. by @changqi1 in #410
- [Kernel] Add FP16 rmsnorm and rope kernels. by @changqi1 in #408
- [Kenrel] Add FP16 LLaMA YARN rotary_embedding. by @changqi1 in #412
- [Benchmark] Add platform options. Support real model. by @JunxiChhen in #409
- [Dependency] Update torch to 2.3.0. by @Duyi-Wang in #416
- [COMM] Fix bugs of core dump && hang when running cross nodes by @abenmao in #423
- [xDNN] Release v1.5.1. by @changqi1 in #422
- [Kernel] Add FP16 MHA and MLP kernels. by @changqi1 in #415
- [Python] Add
get_env()
to get LD_PRELOAD set. by @Duyi-Wang in #427 - Add --padding and fix bug by @yangkunx in #418
- [Layers] Fixed the seg fault error when running with more than 4 ranks by @abenmao in #424
- [Kernel] Less compute for Self-Attention (Q * K) by @pujiang2018 in #420
- [Dependency] Update libiomp5.so to
5.0.20230815
contained in mkl. by @Duyi-Wang in #430 - [Distribute] Add distribute support for continuous batching api. by @Duyi-Wang in #421
- [Layers] Fixed error in yarn by @abenmao in #429
- [README] Update readme. by @Duyi-Wang in #431
- [Dependency] Fix wrong so path returned in
get_env()
. by @Duyi-Wang in #432 - [Version] v1.7.0. by @Duyi-Wang in #433
New Contributors
Full Changelog: v1.6.0...v1.7.0
v1.6.0 - Llama3 and Qwen2 series models supported.
v1.6.0 - Llama3 and Qwen2 series models supported.
Functionality
- Support Llama3 and Qwen2 series models.
- Add INT8 KV cache datatype, using
kv_cache_dtype
params to specify, includingint8
,fp16
(default) andfp32
. - More models enable full BF16 pipline, includes Chatglm2/3 and yarn-llama.
- Add invokeMLPLLaMA FP16 API.
- Support logits output using
forward()
api.
Dependency
- Bump
transformers
to4.40.0
to support Llama3 models.
Performance
- Update xDNN to release
v1.4.6
BUG fix
- Fix numeric overflow when calculate softmax in sampling.
- fix assert bug when concat gate&up.
What's Changed
Generated release nots
- [Model] Expose KV cache data type in Llama model. by @pujiang2018 in #313
- [API] Format rotary_embedding api. by @changqi1 in #303
- [Kernel] Add kernel support for INT8 KV cache. by @pujiang2018 in #314
- [Convert] Fix Qwen convert issue. by @marvin-Yu in #315
- [API] Add invokeMLPLLaMA FP16 API. by @changqi1 in #302
- [Build] Fix build issue. by @changqi1 in #316
- Chatglm2/3 bf16 pipeline support by @a3213105 in #301
- [README] Add README_CN.md. by @Duyi-Wang in #317
- [Kernel] Bug fix for small_gemm_transb by @pujiang2018 in #318
- [Eval] Get logits output. by @marvin-Yu in #319
- [CMake] Add oneccl build depends for comm_helper. by @Duyi-Wang in #322
- [Layers] fix assert bug when concat gate&up by @abenmao in #323
- [Sample] Fix numeric overflow when calculate softmax. by @Duyi-Wang in #326
- [Models] Use factory class to create decoder. by @Duyi-Wang in #321
- [RAEDME] Update readme for the dependent lib. by @xwang98 in #331
- [KVCache] INT8 KV cache implementation and related changes by @pujiang2018 in #320
- [Model] Add Qwen2 model. by @marvin-Yu in #330
- [KVCache] Add inferface and register for kvcache. by @Duyi-Wang in #336
- [Demo] Add kvcache type option in web demo. by @Duyi-Wang in #338
- [Benchmark] Add KVCache data type option. by @Duyi-Wang in #337
- [model] Add llama3 model. by @marvin-Yu in #340
- [Kernel] Add 'acc' param in small_gemm, add lacked and remove unused small_gemm kernels. by @pujiang2018 in #346
- [xDNN] Release v1.4.6. by @changqi1 in #342
- [Evaluation] fix the model register bug in evaluation by @abenmao in #347
- [Models] YaRN-Llama full-link bf16 support by @abenmao in #344
- [UT] Remove beam search test temporarily. by @Duyi-Wang in #349
- [Version] v1.6.0. by @Duyi-Wang in #352
New Contributors
Full Changelog: v1.5.0...v1.6.0
v1.5.0 - Gemma series models supported.
v1.5.0 - Gemma series models supported.
Functionality
- Support Gemma series medels, including Gemma and CodeGemma, and DeepSeek model.
- Llama Converter support convert quantized huggingface model by params
from_quantized_model='gptq'
into xFt format INT8/INT4 model files. - Support loading INT4 data weights directly from local files.
- Optimize memory usage during QWen model conversion, particularly for QWen 72B.
Dependency
- Bump
transformers
to4.38.1
to support Gemma models. - Add
protobuf
to support new behavier intokenzier
.
Performance
- Update xDNN to release
v1.4.5
- Add GPU kernel library gpuDNN v0.1 to support Intel Arc GPU series.
- Optimize ROPE perfermance by reducing repeated sin and cos embedding table data.
- Accelerate KVCache copy by increasing parallelism in self attention.
- Accelerate addreduce operation in long sequence case by transposing KVCache and tuned comm.
BUG fix
- Fix a incorrect computing which should be in float, but was in integer.
- Fix timeline is disordered.
- Fix runtime issue of Qwen when seq_length is bigger than 32768.
Generated release nots
What's Changed
- [Kernel] Fix the incorrect computing which should be in float, but was in integer by @pujiang2018 in #267
- [Layer] Reduce repeated sin and cos embedding table data to optimize ROPE perf. by @changqi1 in #266
- [Kernel] increase parallelism for KV cache copy in self attention by @pujiang2018 in #268
- [Include] Fix include not work. by @Duyi-Wang in #271
- Issue qwen72b seq length by @a3213105 in #273
- [Common] Unify memory allocation into xft::alloc by @pujiang2018 in #272
- [Timeline] Fix disordered timeline. by @changqi1 in #277
- [model] Add deepseek model. by @marvin-Yu in #274
- [Bug] Fix incorrect context parameter order. by @changqi1 in #280
- [CI] Check for UT status. by @marvin-Yu in #278
- [CMake] Check existence of MKL & oneDNN directory before installation. by @Duyi-Wang in #283
- Add KVCache trans for long sequence && tuned comm for faster Addreduce by @abenmao in #279
- [Dependency] Add protobuf in requirements.txt by @Duyi-Wang in #284
- [xDNN] Release v1.4.5. by @changqi1 in #285
- [CI] Add rls test case. by @marvin-Yu in #286
- [Bug] fix baichuan model test issue. by @marvin-Yu in #287
- [Fix] Fix baichuan2-13 without rope. by @marvin-Yu in #289
- [Tools] Add convert tool for Llama models quantized by AutoGPTQ by @xiangzez in #276
- [Common] Support loading int4 weights by @xiangzez in #275
- [KVCache] KV Cache refactor and related unit test case fix by @pujiang2018 in #290
- [Model] Update isMaster func. by @changqi1 in #292
- [Bug] Fix oneDNN GPU build issue. by @changqi1 in #293
- [UT] add unit test for selfAttention, and a small fix by @pujiang2018 in #294
- [gpuDNN] Add gpuDNN v0.1.0 library files. by @feng-intel in #291
- [UT] MLP unit test case fix by @abenmao in #296
- [Fix] Reduce convert memory usage. by @marvin-Yu in #297
- [ENV] Use Meyers' Singleton Env object. by @Duyi-Wang in #295
- [fix] fix compile issue. by @marvin-Yu in #299
- [Example] Add gemma model config and web demo. by @marvin-Yu in #304
- [Model] Add gemma model support. by @marvin-Yu in #259
- [example] add gemma model support with example. by @marvin-Yu in #307
- Bump transformers from 4.36.0 to 4.38.0 in /examples/web_demo by @dependabot in #308
- Fix timeline compile issue by @xiangzez in #309
- [Build] Fix build issues. by @changqi1 in #310
- [Version] v1.5.0. by @Duyi-Wang in #311
New Contributors
- @feng-intel made their first contribution in #291
Full Changelog: v1.4.0...v1.5.0
v1.4.0 - Fully BF16 support in Llama for better performance and serving framework support.
Functionality
- Introduce pure BF16 support to Llama series models, now can use fully BF16 data type to to utilize AMX more effectively when deploying Llama models.
- Add MLServer serving framework support and demo in
serving
directory. - GCC for compiling release binary files has been updated from GCC 8.5 to GCC 12.
- Introduce pipeline parallel feature for distributing deployment. Enabled by
cmake .. -DWITH_PIPELINE_PARALLEL=ON
in compilation and useXFT_PIPELINE_STAGE
Marco to define pipeline parallel stages num. - Deprecate convert tool scripts in
tools
directory and it recommended to usingConvert
in xfastertransformer python wheel. - Support loading int8 data weights directly from local files.
Performance
- Update xDNN to release
v1.4.4
. - Accelerate model weights loading by optimizing cast operation after loading and gain up to 50% speed up.
- Optimize BF16 performance using AMX instruction when batchsize <= 8, and add
XFT_USE_AMX_M
to set threshold of M using AMX instead of AVX512, default1
.
Demo & Benchmark
- Update dependency
transformers
requirement from4.30.0
to4.36.0
for high risk CVE Vulnerabilities. - Add distributed inference benchmark script which support deployment across platfrom.
- Add single node platform support in benchmark script.
- Add Yi model web demo.
- Enhance the command-line chat mode in pytorch demo.py, using
--chat true
to enable.
BUG fix
- Fix calculation issue in Qwen models and enhance LogN support for long token sequence.
- Fix unsync issue in multi-rank model when
do_sample
is enabled. - Fix Baichuan models calculation and convert issue.
- Fix repetition penalties not taking effect on other batches.
What's Changed
- [Demo] Update web demo to adapt gradio 4.11.0. by @Duyi-Wang in #201
- Bump gradio from 3.40.1 to 4.11.0 in /examples/web_demo by @dependabot in #150
- [demo] Add Yi model demo. by @marvin-Yu in #200
- [Dependency] transformers version warning when error occurs. by @Duyi-Wang in #202
- [Tools] Deprecate convert tools in tools dir. by @Duyi-Wang in #203
- [benchmark] Add one node benchmark. by @marvin-Yu in #205
- Bump transformers from 4.30.0 to 4.36.0 by @dependabot in #145
- Bump transformers from 4.30.0 to 4.36.0 in /examples/web_demo by @dependabot in #144
- [CMake] Check if the compiler really supports avx512bf16 with try_compile by @pujiang2018 in #206
- [Layer] Fine grained data type definition for Attention and MLP by @pujiang2018 in #194
- Add recommend GCC version by @a3213105 in #207
- [TP] Make split dimension align with oneDNN packing by @pujiang2018 in #208
- Support loading int8 weights by @xiangzez in #157
- [benchmark] Add distributed benchmark. by @marvin-Yu in #211
- [ci] Fix python path issue. by @marvin-Yu in #214
- [Fix] Fix repetition penalties not taking effect on other batches. by @Duyi-Wang in #212
- [xDNN] Release v1.4.3. by @changqi1 in #213
- [ci] Add workflow permission. by @marvin-Yu in #218
- [Layer] Enable pipeline parallel feature. by @changqi1 in #221
- [Dockerfile] Remove dockerfile. by @Duyi-Wang in #219
- [CI] Align using benchmark tests. by @marvin-Yu in #216
- [xDNN] Release v1.4.4. by @changqi1 in #223
- [Layer] Support pure full-link BF16 LLaMa model. by @pujiang2018 in #222
- [Layers] Qwen LogN for query by @a3213105 in #215
- [Layer] Convert static MMHelper class to instance Class in DecoderContext. by @changqi1 in #225
- [models][layers/tools] Refine and bugfix for baichuan models by @abenmao in #226
- [Serving] Add MLServer serving support. by @Duyi-Wang in #217
- [Dependencies] Remove tokenizers requirement. by @Duyi-Wang in #227
- [kernel] Add ICX compiler. by @changqi1 in #228
- [Env] Add XFT_ENGINE env variable. by @changqi1 in #231
- [CMake] Open the pip-install information for MKL. by @marvin-Yu in #234
- [Fix] Add parameter check for logN and NTK rotary embedding of QWEN by @a3213105 in #232
- [CMake] Remvoe force reinstall for mkl dependencies. by @Duyi-Wang in #237
- [Example] Add seq_length in qwen fake config.ini by @Duyi-Wang in #238
- [Tools] Accelerate model loading. by @marvin-Yu in #224
- [Fix] Fix the wrong output of QWEN-14B. by @marvin-Yu in #240
- fix issue #220 by @a3213105 in #242
- Bump gradio from 4.11.0 to 4.19.2 in /examples/web_demo by @dependabot in #241
- [Example] Add llama2 chat support in Cli demo. by @Duyi-Wang in #243
- [Dependency] Update web demo requirement. by @Duyi-Wang in #246
- [Docs] Initial documents. by @Duyi-Wang in #248
- Fix Opt issue by @xiangzez in #251
- [Serving] Fix fail to set pad_token_id when it's not None in single mode. by @Duyi-Wang in #254
- [layers] Add bf16-type input/output support for flash attention by @abenmao in #252
- [Kernel] Set USE_AMX_M to 1. by @Duyi-Wang in #245
- [Benchmark] Fix typo in benchmark script. by @Duyi-Wang in #261
- [Attention Kernel/Layer] group attention support in full-link BF16 path; attention layer refactor by @pujiang2018 in #258
- [Search] Sync smaple result in multi-rank. by @Duyi-Wang in #260
- [Benchmark] Update model cfg for transformers>4.36. by @Duyi-Wang in #257
- [Layer] Use flash attention when larger than threshold ('>=' to '>') by @pujiang2018 in #265
- [Benchmark] Modify CPU affinity logic, add CI prompt output. by @marvin-Yu in #263
- [Version] v1.4.0. by @Duyi-Wang in #262
New Contributors
- @dependabot made their first contribution in #150
Full Changelog: v1.3.1...v1.4.0
v1.3.1
BUG fix
- Fix oneCCL environment is still needed when running in single-rank mode.
What's Changed
- [demo] Add qwen demo deps. by @marvin-Yu in #193
- Add base_initial to store the original base passed from config by @a3213105 in #196
- [Comm] Check mpirun and env before load helper. by @Duyi-Wang in #197
- [Version] v1.3.1. by @Duyi-Wang in #198
Full Changelog: v1.3.0...v1.3.1
v1.3.0 - Qwen model support enhancement and added support for the SecLLM (YaRN-Llama) model.
Models
- Introduce SecLLM(YaRN-Llama) model support.
- Integrating the Qwen web demo, enhancing Qwen model support, and fix known issues in the Qwen convert tool.
Functionality
- Introduce new generation configuration,
repetition_penalty
andstop_words_ids
. - Rotary embedding supports BF16 data type now.
- Introduce attention interfaces similar to page attention.
- Add a whitelist to gather timeline events based on filtered events.
BUG fix
- Fix
libxft_comm_helper.so
can't be found issue in multi-ranks mode. - Fix assert error in MLP when CAT_MLP opt is enabled.
- Fix a w8a8 crash issue due to buffer size isn't big enough.
- Correct GCC version for AVX512_BF16 instruction set.
- Fix int32 overflow issue for larger size.
What's Changed
- [Fix] fix assert error in MLP when enable CAT_MLP opt. by @abenmao in #151
- clang-format: Remove duplicated IndentPPDirectives entry by @huaqiangwang in #159
- [Demo] fix mpi stick the stdout/err in web_demo. by @marvin-Yu in #162
- [Fix] comm_helper can't found issue. by @Duyi-Wang in #161
- [Demo] use average next-token latency as the info to user. by @huaqiangwang in #160
- [Search] Add repetition penalty. by @Duyi-Wang in #163
- Add attention interface (like page attention) by @pujiang2018 in #143
- [Fix] index ID lowerbound check in repetition penalty. by @Duyi-Wang in #167
- Add a white list to collect timeline events on a filtered events by @huaqiangwang in #156
- [Fix] Fix build issue for the TIMELINE filter feature by @huaqiangwang in #169
- [ChatGLM2] Remove unused code. by @a3213105 in #168
- [Benchmark] fix Benchmark performance issue. by @marvin-Yu in #170
- [Framework] Attention/LayerNorm/RmsNorm refactor/enhance to better support BF16 inference. by @pujiang2018 in #171
- [kernel] Fix w8a8 crash issue due to buffer size not big enough by @xiangzez in #158
- [Benchmark] Avoid float in core pre numa calculation. by @Duyi-Wang in #164
- [Model][SecLLM] Add SecLLM(YaRN-Llama) model support by @abenmao in #172
- [LOG] Default disable fake loading log print. by @Duyi-Wang in #173
- [Layer] BF16 support for rotary embedding by @pujiang2018 in #176
- [examples] add qwen & chatglm3 model config. by @marvin-Yu in #177
- [convert] Fix qwen convert with no eos id. by @marvin-Yu in #181
- [demo] Fix chatGLM3 webdemo. by @marvin-Yu in #184
- [Generation] Add stop_words_ids generation config. by @Duyi-Wang in #183
- [Page attention]Prefill add kv cache by @aurora327 in #178
- [common/utils] Fix bug of int32 overflow for larger size by @abenmao in #187
- [Generate] Sync stop words ids in multi-rank mode. by @Duyi-Wang in #190
- [demo] Add qwen demo. by @marvin-Yu in #180
- [Kernel] Add Qwen rotary_embedding ntk support. by @changqi1 in #189
- [Version] v1.3.0. by @Duyi-Wang in #191
New Contributors
- @huaqiangwang made their first contribution in #159
Full Changelog: v1.2.0...v1.3.0
v1.2.0 - Qwen models and much more data types supported.
Models
- Introduced Qwen models support and added the convert tool for Qwen models.
- ChatGLM3 model is verfied and API supported.
Performance Optimizations
- Update xDNN to version 1.4.2 to improve performance and support more data types.
- Accelerate first token's generation with BF16-gemm Multi-Head Attention.
Functionality
- Introduce more data types supports, including
W8A8
,INT4
, andNF4
. The hybrid data types between these new data types are supported. - Add accuracy evaluation script to assess the impact of different precisions on the text generation performance of the model.
- Introduce
XFT_VERBOSE
macro to help profile model performance of each gemm. Set1
to enable information ouput and default is0
. - Decouple oneCCL and MPI dependencies into a communication helper library. oneCCL environment is no longer needed when running in single-rank mode.
v1.1.0 - Baichuan models supported.
Models
- Introduced Baichuan models support and added the convert tool for Baichuan models.
Performance Optimizations
- Update xDNN to version 1.2.1 to improve performance of BF16 data type with AMX instruction on 4th generation Intel Xeon Scalable processors.
- Improved performance of BF16 data type inference by adding matMul bf16bf16bf16 primitives and optimizing kernel selection strategy.
- Improved performance of the model with unbalanced split allocation.
Functionality
- Introduced prefix sharing feature.
- Add sample strategy for token search, support temperature, top k, and top P parameter.
- Introduce convert module to xfastertransformer python API.
- Introduced grouped-query attention support for Llama2.
- Auto-detect oneCCL environment and enter single-rank model if oneCCL does not exist.
- Auto-detect oneCCL environment in compilation. If not detected, oneCCL will be built from source.
- Add C++ exit function for multi-rank model.
- Remove mklml 3rd party dependency.
- Export normalization and position embedding C++ API, including alibi embedding and rotary embedding.
- Introduced
XFT_DEBUG_DIR
environment value to specify the debug file directory.
BUG fix
- Fix runtime issue of oneCCL shared memory model.
- Fix path concat issue in convert tools.
source-publish
Sources used in xFasterTransformer docker release image with a license that requires publication: GPL, LGPL, MPL
Each of the tar files is a component in xFasterTransformer that has a license
that requires the publication of the sources. This includes GPL, LGPL, and MPL. We
are publishing the original sources including any patches. Each component has
its own license, so we do not provide a license for this release.
If you need the sources for a component that is not included here, please
contact [email protected]
v1.0.0
This is the 1st official release of xFasterTransformer.🎇🎇🎇
Support models
- ChatGLM-6B
- ChatGLM2-6B
- Llama 1, both 7B, 33B, and 65B
- Llama 2, both 7B, 13B, and 70B
- Opt larger than 1.3B
Features
- Support Python and C++ API to integrate xFasterTransformer into the user's own solutions. Example codes are provided to demonstrate the usage.
- Support hybrid data types such as BF16+FP16 and BF16+INT8 to accelerate the generation of the 1st token, in addition to supporting single data types like FP16, BF16, and INT8.
- Support multiple instances to accelerate model inference, both locally and through the network.
- Support Intel AMX instruction on 4th generation Intel Xeon Scalable processors.
- Support 4th generation Intel Xeon Scalable processors with HBM which has a higher memory bandwidth and shows a much better performance on LLM.
- Provide web demo scripts for users to show the performance of LLM models optimized by xFasterTransformer.
- Support multiple distribution methods, both PyPI and docker images.