Skip to content

Conversation

njhill
Copy link
Member

@njhill njhill commented Aug 14, 2025

FIX #22831
FIX #22839

  • Ensure async zmq sockets are closed from the event loop
  • Set LINGER=0 on stats update sockets to ensure they don't block the context being closed

- Ensure async zmq sockets are closed from the event loop
- Set LINGER=0 on stats update sockets to ensure they don't block the context being closed

Signed-off-by: Nick Hill <[email protected]>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a potential hanging issue when closing asynchronous ZMQ sockets by ensuring they are closed from within their event loop. The changes introduce a close_sockets helper function and modify the BackgroundResources finalizer to use loop.call_soon_threadsafe for closing async sockets and cancelling related tasks. This is the correct approach for thread-safe cleanup of asyncio resources. Additionally, linger=0 is set on stats update sockets to prevent them from blocking context termination. The changes appear correct and well-implemented to fix the described bug.

@Isotr0py Isotr0py added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 14, 2025
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@Isotr0py Isotr0py enabled auto-merge (squash) August 14, 2025 05:18
njhill and others added 2 commits August 13, 2025 23:21
As @Isotr0py did in his original PR

Signed-off-by: Nick Hill <[email protected]>

Co-authored-by: Isotr0py <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
@Isotr0py Isotr0py disabled auto-merge August 14, 2025 08:16
@Isotr0py Isotr0py enabled auto-merge (squash) August 14, 2025 08:16
@DarkLight1337
Copy link
Member

Unblocking 4 GPUs test

@DarkLight1337
Copy link
Member

Passed as well, merging

@vllm-bot vllm-bot merged commit eb08487 into vllm-project:main Aug 14, 2025
39 of 42 checks passed
@njhill njhill deleted the fix-dist-test branch August 14, 2025 13:20
@njhill
Copy link
Member Author

njhill commented Aug 14, 2025

@Isotr0py you probably saw, I incorporated your changes after all for explicitly closing the other sockets because they did appear to be needed. I wasn't able to reproduce locally but the CI test was still hanging without them.

@Isotr0py
Copy link
Member

Yes, I can't reproduce the hanging issue locally but it will stuck at gc.collect for about 3~5 mins:

Current thread 0x0000792424c13740 (most recent call first):
  Garbage-collecting
  File "/kaggle/working/vllm/.venv/lib/python3.12/site-packages/zmq/sugar/context.py", line 264 in term
  File "/kaggle/working/vllm/.venv/lib/python3.12/site-packages/zmq/sugar/context.py", line 322 in destroy
  File "/kaggle/working/vllm/.venv/lib/python3.12/site-packages/zmq/sugar/context.py", line 140 in __del__
  File "/kaggle/working/vllm/vllm/distributed/parallel_state.py", line 1277 in cleanup_dist_env_and_memory
  File "/kaggle/working/vllm/tests/conftest.py", line 199 in cleanup_fixture
  File "/kaggle/working/vllm/.venv/lib/python3.12/site-packages/_pytest/fixtures.py", line 907 in _teardown_yield_fixture

I wonder if there are some issues at ZMQ side...

yiliu30 pushed a commit to yiliu30/vllm-fork that referenced this pull request Aug 19, 2025
divakar-amd pushed a commit to divakar-amd/vllm_upstream that referenced this pull request Aug 20, 2025
djmmoss pushed a commit to djmmoss/vllm that referenced this pull request Aug 21, 2025
Signed-off-by: Nick Hill <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
Signed-off-by: Duncan Moss <[email protected]>
epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025
xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025
Signed-off-by: Nick Hill <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
Signed-off-by: Xiao Yu <[email protected]>
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI Failure][NIGHTLY FIRE DRILL]: Distributed Tests (2 GPUS) [CI Failure][NIGHTLY FIRE DRILL]: Distributed Tests (4GPUs)
4 participants