[Misc] Mooncake EP & Mooncake Backend #805

UNIDY2002 · 2025-09-04T09:26:07Z

In this PR, we propose Mooncake EP and the Mooncake Backend.

Mooncake EP is an adaptation of DeepEP that supports fault tolerance for large-scale MoE inference. It remains API-compatible with DeepEP, with an extra broken_ranks tensor to track failed ranks.

Mooncake Backend is a PyTorch distributed backend, designed as a fault-tolerant replacement for NCCL and Gloo. it can continue to perform collective communication under rank failures and reports them to upper layers for graceful handling.

Tests

Since the C++ APIs are not intended for direct use, no C++ unit tests are provided. Instead, three Python unit tests are included under mooncake-wheel/tests/:

test_mooncake_ep.py: Adapted from DeepEP’s test_low_latency.py. Verifies the correctness of the EP APIs and includes a basic performance test.
test_mooncake_backend.py: Validates the correctness of the Mooncake Backend.
test_mooncake_backend_perf.py: Compares the performance of the Mooncake Backend against NCCL and Gloo.

Performance

Tested on a 8 * H100 node.

Mooncake EP (pure RDMA)

Impl	Dispatch bandwidth	Dispatch latency	Combine bandwidth	Combine latency
Mooncake	41 GB/s	184 us	38 GB/s	387 us
DeepEP	46 GB/s	163 us	46 GB/s	318 us

Mooncake Backend

Here is the preliminary performance result of the Mooncake Backend. Further optimizations will be done in the future.

All data are in microseconds.

Mooncake v.s. Gloo

Allgather

Data Size	Mooncake	Gloo
1K	94	681
4K	125	834
16K	288	1121
64K	928	6253
256K	3715	8163
1M	7929	37067
4M	31239	142334

Allreduce

Data Size	Mooncake	Gloo
1K	87	1334
4K	163	1358
16K	476	1482
64K	1623	1606
256K	6382	2202
1M	23194	5324
4M	92664	15734

Broadcast

Data Size	Mooncake	Gloo
1K	61	101
4K	87	129
16K	142	177
64K	389	449
256K	1389	1130
1M	1662	2759
4M	7876	11559

Mooncake v.s. NCCL

Allgather

Data Size	Mooncake	NCCL
1K	67	93
4K	69	88
16K	78	93
64K	122	84
256K	293	81
1M	1038	178
4M	4158	521

Allreduce

Data Size	Mooncake	NCCL
1K	57	34
4K	60	30
16K	77	31
64K	122	30
256K	300	31
1M	1112	53
4M	14421	119

Broadcast

Data Size	Mooncake	NCCL
1K	50	28
4K	38	26
16K	47	27
64K	100	28
256K	246	34
1M	834	28
4M	3196	68

ShangmingCai · 2025-09-15T06:01:21Z

.github/workflows/ci.yml

+    - name: Install CUDA Toolkit
+      uses: Jimver/[email protected]
+      with:
+        cuda: '12.8.1'
+        linux-local-args: '["--toolkit"]'
+        method: 'network'
+        sub-packages: '["nvcc", "nvrtc-dev"]'
+        non-cuda-sub-packages: '["libcusparse-dev", "libcublas-dev", "libcusolver-dev"]'


@xiaguan do you have time to check on this? Do you know if this is supported on our CI machine?

https://github.com/kvcache-ai/Mooncake/actions/runs/17720954259/job/50353039158?pr=805

It compiles successfully in CI, but I'm not sure if the .whl package will actually work for users.

I think the users would usually have the full toolkit installed. I tested the .whl in the SGLang docker environment, and it could work :)

doc/en/build.md

doc/en/ep-backend.md

whybeyoung · 2025-09-15T07:02:02Z

Amazing work!

ShangmingCai · 2025-09-15T08:06:06Z

mooncake-wheel/setup.py

+if int(os.getenv("BUILD_WITH_EP", "0")):
+    import torch
+    from torch.utils.cpp_extension import BuildExtension, CUDAExtension
+    abi_flag = int(torch._C._GLIBCXX_USE_CXX11_ABI)
+    current_dir = os.path.abspath(os.path.dirname(__file__))
+    ext_modules = [
+        CUDAExtension(
+            name="mooncake.ep",
+            include_dirs=[
+                os.path.join(current_dir, "../mooncake-ep/include"),
+                os.path.join(current_dir, "../mooncake-transfer-engine/include"),
+            ],
+            sources=["../mooncake-integration/ep/ep_py.cpp"],
+            extra_compile_args={
+                "cxx": [f"-D_GLIBCXX_USE_CXX11_ABI={abi_flag}", "-std=c++20"],
+                "nvcc": [f"-D_GLIBCXX_USE_CXX11_ABI={abi_flag}", "-std=c++20"],
+            },
+            libraries=["ibverbs", "mlx5"],
+            extra_objects=[
+                os.path.join(current_dir, "../build/mooncake-ep/src/libmooncake_ep.a"),
+                os.path.join(current_dir, "mooncake/engine.so"),
+            ],
+        )
+    ]
+    setup(
+        distclass=BinaryDistribution,
+        cmdclass={
+            "bdist_wheel": CustomBdistWheel,
+            "build_ext": BuildExtension,
+        },
+        ext_modules=ext_modules,
+    )
+else:
+    setup(
+        distclass=BinaryDistribution,
+        cmdclass={"bdist_wheel": CustomBdistWheel},
+    )


Is -std=c++20 the minimum required version? cc: @xiaguan

Mooncake Store needs C++20, others could probably use a lower C++ standard like C++17.

It seems that a C++20 feature is used here (starts_with)

Mooncake/mooncake-transfer-engine/include/common.h

Line 165 in 4e03dbe

if (server_name.starts_with("[")) {

mooncake-wheel/mooncake/mooncake_ep_buffer.py

ShangmingCai · 2025-09-15T08:43:36Z

mooncake-wheel/mooncake/mooncake_ep_buffer.py

+    def dispatch(self, x: torch.Tensor, topk_idx: torch.Tensor, broken_ranks: torch.Tensor,
+                 num_max_dispatch_tokens_per_rank: int, num_experts: int, timeout_us: int,
+                 use_fp8: bool = True, async_finish: bool = False, return_recv_hook: bool = False) -> \
+            Tuple[Tuple[torch.Tensor, torch.Tensor], torch.Tensor, Tuple, EventOverlap, Callable]:


This should be fixed as well.

Changed Tuple[torch.Tensor, torch.Tensor] to Union[Tuple[torch.Tensor, torch.Tensor], torch.Tensor]

mooncake-wheel/mooncake/mooncake_ep_buffer.py

ShangmingCai · 2025-09-15T09:08:10Z

I have another urgent PR need to test and review today, will continue with this PR tomorrow.

@alogfans Please take some time to review this PR as well.

mooncake-wheel/mooncake/mooncake_ep_buffer.py

ShangmingCai · 2025-09-22T02:43:07Z

mooncake-ep/src/mooncake_worker.cu

+    TORCH_CHECK(tensorSize * meta->size < kBufferSize, "Too large!");
+    auto future = c10::make_intrusive<c10::ivalue::Future>(
+        c10::ListType::create(c10::TensorType::get()));
+    int taskId = cpuTaskCount % 2;


Maybe need a comment here for clarification?

A comment is added.

ShangmingCai · 2025-09-22T02:45:57Z

mooncake-integration/ep/ep_py.cpp

+                       .attr("__version__")
+                       .attr("split")("+")
+                       .cast<std::vector<std::string>>()[0];
+    TORCH_CHECK(version == "2.8.0", "Mooncake Backend requires torch==2.8.0");


Should we use >= in case SGLang/vLLM requires a newer version of PyTorch?

I'm afraid a strict equal is required here, as the Mooncake lib should match the libtorch C++ ABI.

If SGLang/vLLM require a newer version of PyTorch, perhaps we have to recompile Mooncake with the corresponding PyTorch version. (Or, to be optimistic, we might figure out a better solution in the following versions.)

mooncake-ep/include/mooncake_ep_buffer.h

ShangmingCai

This is a huge PR. I have finished several rounds of basic reviews with some easy-to-fix problems. I think we can merge this first after addressing the above comments to see if we can get some user feedback. CC: @alogfans, better take a look before merging this PR.

UNIDY2002 · 2025-09-22T09:45:01Z

@ShangmingCai Thanks for your review and valuable feedbacks! I'll fix the issues.

alogfans · 2025-09-26T02:02:11Z

I agree with @ShangmingCai, merge it first.

UNIDY2002 added 30 commits August 12, 2025 16:16

Initialize a mooncake backend

34d7deb

Add pybind

fa80700

Fix incorrect backend registration

a401f32

Fix wheel building of mooncake_ep

b1e5dfc

Add a fake allreduce implementation

b535bbb

Introduce transfer_engine to mooncake_backend

779b447

Add a basic CPU proxy execution framework

fb8918e

Implement a seemingly working allgather

cf75d22

Remove mooncake_ep's dependency on etcd

1427db9

Implement _allgather_base

fab5451

Implement allreduce

bfe83e0

Implement alltoall

74834c1

Use an even-odd pattern for data transfer

3f72b68

Add a set_host_ip method

1c548b3

Switch to an extended-API implementation of the Mooncake backend

e99eb08

Implement broadcast

7bc9438

Implement barrier

635197f

Extend Mooncake backend to CPU

b1a5a37

Support more operations for reduction

9a610cb

Fix the backend-worker coordination logic

1d06727

Optimize CPU worker with a callback pattern

6169ff9

Add a timeout-based broken-ranks detection

10f2f8e

Merge EP module into Mooncake's build system

bbcf85c

Share transfer buffer across all worker instances

964e0a9

Switch to a more robust approach to detect broken ranks

d391ba9

Specify CUDA device for test_mooncake_backend.py

2f308f9

Explicitly stop mooncake worker

3a0d872

Use transfer engine's notifications to implement collective signals

f20ffb2

Remove the unused all_reduce_without API

896f668

Switch to mooncake backend for test_mooncake_ep.py

9f82a37

UNIDY2002 force-pushed the sunxun/mooncake-backend-dev branch from 82b0d6c to f30e29d Compare September 15, 2025 03:08