[Draft] Enable CUTLASS without host compiler #1967
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a draft PR to enable CUTLASS in torch-xpu-ops so that we can test cutlass kernels' accuracy/performance in Pytorch when SDPA/GEMM kernels are ready.
Since there is not determined plan of how to import
cutlass-sycl
repo, I download it in cmake for debug convinence.I put all pytorch/aten wrapper functions in
ATen/native/cutlass/*.cpp
, which extract problem shape and device pointer from at::Tensor. It is compiled and link intolibtorch_xpu_ops.a
with pure gcc. They will call kernel launch functions fromATen/native/cutlass/sycl/*.cpp
I put all cutlass sycl kernels functions in
ATen/native/cutlass/sycl/*.cpp
. Since CUTLASS and syclcompat don't support-fsycl-host-compiler=g++
, I compile the.cpp
files of cutlass kernels intolibcutlass_kernels.so
library with pure icpx and then link it tolibtorch_xpu_ops.a
with gcc linker.Currently, due to
libcutlass_kernels.so
links tolibtorch_xpu_ops.a
libtorch_xpu_ops.a
links tolibtorch_xpu.so
libtorch_xpu.so
compiles/linksaten/src/ATen/native/mkldnn/xpu/detail/Attention.cpp
andaten/src/ATen/native/mkldnn/xpu/Attention.cpp
torch-xpu-ops
is exposed tolibtorch_xpu.so
.Pytorch/aten/src/ATen/native/mkldnn/xpu/Attention.cpp
can call wrap functions fromtorch-xpu-ops/src/ATen/native/cutlass/*.h
directly like belowI have verified that overrideable_sdpa_backward can call into YuanKun's cutlass sdpa backward kernel now. It passes a few accuracy UT.