-
Notifications
You must be signed in to change notification settings - Fork 384
[Misc] Mooncake EP & Mooncake Backend #805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
82b0d6c
to
f30e29d
Compare
- name: Install CUDA Toolkit | ||
uses: Jimver/[email protected] | ||
with: | ||
cuda: '12.8.1' | ||
linux-local-args: '["--toolkit"]' | ||
method: 'network' | ||
sub-packages: '["nvcc", "nvrtc-dev"]' | ||
non-cuda-sub-packages: '["libcusparse-dev", "libcublas-dev", "libcusolver-dev"]' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xiaguan do you have time to check on this? Do you know if this is supported on our CI machine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/kvcache-ai/Mooncake/actions/runs/17720954259/job/50353039158?pr=805
It compiles successfully in CI, but I'm not sure if the .whl package will actually work for users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the users would usually have the full toolkit installed. I tested the .whl in the SGLang docker environment, and it could work :)
Amazing work! |
if int(os.getenv("BUILD_WITH_EP", "0")): | ||
import torch | ||
from torch.utils.cpp_extension import BuildExtension, CUDAExtension | ||
abi_flag = int(torch._C._GLIBCXX_USE_CXX11_ABI) | ||
current_dir = os.path.abspath(os.path.dirname(__file__)) | ||
ext_modules = [ | ||
CUDAExtension( | ||
name="mooncake.ep", | ||
include_dirs=[ | ||
os.path.join(current_dir, "../mooncake-ep/include"), | ||
os.path.join(current_dir, "../mooncake-transfer-engine/include"), | ||
], | ||
sources=["../mooncake-integration/ep/ep_py.cpp"], | ||
extra_compile_args={ | ||
"cxx": [f"-D_GLIBCXX_USE_CXX11_ABI={abi_flag}", "-std=c++20"], | ||
"nvcc": [f"-D_GLIBCXX_USE_CXX11_ABI={abi_flag}", "-std=c++20"], | ||
}, | ||
libraries=["ibverbs", "mlx5"], | ||
extra_objects=[ | ||
os.path.join(current_dir, "../build/mooncake-ep/src/libmooncake_ep.a"), | ||
os.path.join(current_dir, "mooncake/engine.so"), | ||
], | ||
) | ||
] | ||
setup( | ||
distclass=BinaryDistribution, | ||
cmdclass={ | ||
"bdist_wheel": CustomBdistWheel, | ||
"build_ext": BuildExtension, | ||
}, | ||
ext_modules=ext_modules, | ||
) | ||
else: | ||
setup( | ||
distclass=BinaryDistribution, | ||
cmdclass={"bdist_wheel": CustomBdistWheel}, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is -std=c++20
the minimum required version? cc: @xiaguan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mooncake Store needs C++20, others could probably use a lower C++ standard like C++17.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that a C++20 feature is used here (starts_with
)
if (server_name.starts_with("[")) { |
def dispatch(self, x: torch.Tensor, topk_idx: torch.Tensor, broken_ranks: torch.Tensor, | ||
num_max_dispatch_tokens_per_rank: int, num_experts: int, timeout_us: int, | ||
use_fp8: bool = True, async_finish: bool = False, return_recv_hook: bool = False) -> \ | ||
Tuple[Tuple[torch.Tensor, torch.Tensor], torch.Tensor, Tuple, EventOverlap, Callable]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be fixed as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed Tuple[torch.Tensor, torch.Tensor]
to Union[Tuple[torch.Tensor, torch.Tensor], torch.Tensor]
I have another urgent PR need to test and review today, will continue with this PR tomorrow. @alogfans Please take some time to review this PR as well. |
TORCH_CHECK(tensorSize * meta->size < kBufferSize, "Too large!"); | ||
auto future = c10::make_intrusive<c10::ivalue::Future>( | ||
c10::ListType::create(c10::TensorType::get())); | ||
int taskId = cpuTaskCount % 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe need a comment here for clarification?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment is added.
.attr("__version__") | ||
.attr("split")("+") | ||
.cast<std::vector<std::string>>()[0]; | ||
TORCH_CHECK(version == "2.8.0", "Mooncake Backend requires torch==2.8.0"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use >=
in case SGLang/vLLM requires a newer version of PyTorch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid a strict equal is required here, as the Mooncake lib should match the libtorch C++ ABI.
If SGLang/vLLM require a newer version of PyTorch, perhaps we have to recompile Mooncake with the corresponding PyTorch version. (Or, to be optimistic, we might figure out a better solution in the following versions.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a huge PR. I have finished several rounds of basic reviews with some easy-to-fix problems. I think we can merge this first after addressing the above comments to see if we can get some user feedback. CC: @alogfans, better take a look before merging this PR.
@ShangmingCai Thanks for your review and valuable feedbacks! I'll fix the issues. |
I agree with @ShangmingCai, merge it first. |
In this PR, we propose Mooncake EP and the Mooncake Backend.
Mooncake EP is an adaptation of DeepEP that supports fault tolerance for large-scale MoE inference. It remains API-compatible with DeepEP, with an extra
broken_ranks
tensor to track failed ranks.Mooncake Backend is a PyTorch distributed backend, designed as a fault-tolerant replacement for NCCL and Gloo. it can continue to perform collective communication under rank failures and reports them to upper layers for graceful handling.
Read more at
doc/en/ep-backend.md
.Tests
Since the C++ APIs are not intended for direct use, no C++ unit tests are provided. Instead, three Python unit tests are included under
mooncake-wheel/tests/
:Performance
Tested on a 8 * H100 node.
Mooncake EP (pure RDMA)
Mooncake Backend
Here is the preliminary performance result of the Mooncake Backend. Further optimizations will be done in the future.
All data are in microseconds.
Mooncake v.s. Gloo
Allgather
Allreduce
Broadcast
Mooncake v.s. NCCL
Allgather
Allreduce
Broadcast