Skip to content

Conversation

LuFinch
Copy link
Contributor

@LuFinch LuFinch commented Aug 27, 2025

This is a draft PR to enable CUTLASS in torch-xpu-ops so that we can test cutlass kernels' accuracy/performance in Pytorch when SDPA/GEMM kernels are ready.

Since there is not determined plan of how to import cutlass-sycl repo, I download it in cmake for debug convinence.

I put all pytorch/aten wrapper functions in ATen/native/cutlass/*.cpp, which extract problem shape and device pointer from at::Tensor. It is compiled and link into libtorch_xpu_ops.a with pure gcc. They will call kernel launch functions from ATen/native/cutlass/sycl/*.cpp

I put all cutlass sycl kernels functions in ATen/native/cutlass/sycl/*.cpp. Since CUTLASS and syclcompat don't support -fsycl-host-compiler=g++, I compile the .cpp files of cutlass kernels into libcutlass_kernels.so library with pure icpx and then link it to libtorch_xpu_ops.a with gcc linker.

Currently, due to

  1. libcutlass_kernels.so links to libtorch_xpu_ops.a
  2. libtorch_xpu_ops.a links to libtorch_xpu.so
  3. libtorch_xpu.so compiles/links aten/src/ATen/native/mkldnn/xpu/detail/Attention.cpp and aten/src/ATen/native/mkldnn/xpu/Attention.cpp
  4. the include directory of torch-xpu-ops is exposed to libtorch_xpu.so.

Pytorch/aten/src/ATen/native/mkldnn/xpu/Attention.cpp can call wrap functions from torch-xpu-ops/src/ATen/native/cutlass/*.h directly like below

#include <ATen/native/cutlass/Attention.h>
at::native::cutlass_sycl::sdpa_backward(...);

I have verified that overrideable_sdpa_backward can call into YuanKun's cutlass sdpa backward kernel now. It passes a few accuracy UT.

@LuFinch
Copy link
Contributor Author

LuFinch commented Aug 27, 2025

@EikanWang Could you help take a look? I have verified that the aten .cpp in Pytorch can invocate functions from torch-xpu-ops's aten .cpp.

The issues need to discuss are

  1. Where should we put the cutlass kernels? Pytorch or torch-xpu-ops?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant