[V1] reduce block size for tree attention correctness test to fix 'ou… #22207

TheEpicDolphin · 2025-08-04T17:29:38Z

Purpose

Fix the following "out of resource" error that occurs during execution of the kernel_unified_attention_2d kernel. Here's the error:

[2025-08-04T06:55:28Z] v1/spec_decode/test_tree_attention.py:110: in forward_attention
--
  | [2025-08-04T06:55:28Z]     return instance.forward(
  | [2025-08-04T06:55:28Z] /usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/tree_attn.py:432: in forward
  | [2025-08-04T06:55:28Z]     unified_attention(
  | [2025-08-04T06:55:28Z] /usr/local/lib/python3.12/dist-packages/vllm/attention/ops/triton_unified_attention.py:664: in unified_attention
  | [2025-08-04T06:55:28Z]     kernel_unified_attention_2d[(
  | [2025-08-04T06:55:28Z] /usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py:347: in <lambda>
  | [2025-08-04T06:55:28Z]     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  | [2025-08-04T06:55:28Z] /usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py:591: in run
  | [2025-08-04T06:55:28Z]     kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata,
  | [2025-08-04T06:55:28Z] /usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py:413: in __getattribute__
  | [2025-08-04T06:55:28Z]     self._init_handles()
  | [2025-08-04T06:55:28Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
  | [2025-08-04T06:55:28Z]
  | [2025-08-04T06:55:28Z] self = <triton.compiler.compiler.CompiledKernel object at 0x7f4cedee5df0>
  | [2025-08-04T06:55:28Z]
  | [2025-08-04T06:55:28Z]     def _init_handles(self):
  | [2025-08-04T06:55:28Z]         if self.module is not None:
  | [2025-08-04T06:55:28Z]             return
  | [2025-08-04T06:55:28Z]         device = driver.active.get_current_device()
  | [2025-08-04T06:55:28Z]         # create launcher
  | [2025-08-04T06:55:28Z]         self.run = driver.active.launcher_cls(self.src, self.metadata)
  | [2025-08-04T06:55:28Z]         # not enough shared memory to run the kernel
  | [2025-08-04T06:55:28Z]         max_shared = driver.active.utils.get_device_properties(device)["max_shared_mem"]
  | [2025-08-04T06:55:28Z]         if self.metadata.shared > max_shared:
  | [2025-08-04T06:55:28Z] >           raise OutOfResources(self.metadata.shared, max_shared, "shared memory")
  | [2025-08-04T06:55:28Z] E           triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 155648, Hardware limit: 101376. Reducing block sizes or `num_stages` may help.
  | [2025-08-04T06:55:28Z]
  | [2025-08-04T06:55:28Z] /usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py:401: OutOfResources
  | [2025-08-04T06:55:28Z] =============================== warnings summary ===============================
  | [2025-08-04T06:55:28Z] ../../usr/local/lib/python3.12/dist-packages/schemathesis/generation/coverage.py:305
  | [2025-08-04T06:55:28Z]   /usr/local/lib/python3.12/dist-packages/schemathesis/generation/coverage.py:305: DeprecationWarning: jsonschema.exceptions.RefResolutionError is deprecated as of version 4.18.0. If you wish to catch potential reference resolution errors, directly catch referencing.exceptions.Unresolvable.
  | [2025-08-04T06:55:28Z]     ref_error: type[Exception] = jsonschema.RefResolutionError,
  | [2025-08-04T06:55:28Z]
  | [2025-08-04T06:55:28Z] -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
  | [2025-08-04T06:55:28Z] =========================== short test summary info ============================
  | [2025-08-04T06:55:28Z] FAILED v1/spec_decode/test_tree_attention.py::test_tree_attn_correctness - triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 155648, Hardware limit: 101376. Reducing block sizes or `num_stages` may help.
  | [2025-08-04T06:55:28Z] ============= 1 failed, 24 passed, 1 warning in 233.80s (0:03:53) ==============
  | [2025-08-04T06:55:31Z] 🚨 Error: The command exited with status 1

It is currently triggered by the test_tree_attention test, and is a result of using a large block size (128). I have reduced the block size to 32, reducing the per-block shared memory usage to ~38K, which is supported by the majority of modern hardware.

Test Plan

(py312conda) bash-5.1$ pytest tests/v1/spec_decode/test_tree_attention.py -k test_tree_attn_correctness
============================================================================================================================================ test session starts ============================================================================================================================================
platform linux -- Python 3.12.9, pytest-8.4.1, pluggy-1.6.0
rootdir: /data/users/gdelfin/gitrepos/vllm
configfile: pyproject.toml
plugins: anyio-4.9.0, asyncio-1.1.0
asyncio: mode=Mode.STRICT, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 1 item

tests/v1/spec_decode/test_tree_attention.py . [100%]

============================================================================================================================================ 1 passed in 34.33s =============================================================================================================================================

…t of resource' triton error Signed-off-by: Giancarlo Delfin <[email protected]>

github-actions · 2025-08-04T17:29:45Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request addresses an 'out of resource' error in the test_tree_attn_correctness test by reducing the block_size from 128 to 32. While this fixes the immediate issue on hardware with limited shared memory, it also reduces test coverage for a valid and important configuration. My review includes a suggestion to conditionally set the block_size based on available GPU resources. This approach maintains test coverage on capable hardware while ensuring the test suite remains stable on more constrained environments, thus improving the overall robustness of the tests.

tests/v1/spec_decode/test_tree_attention.py

sgrigory

Stamping to avoid broken CI. Let's follow up to add a check that tree attention backend is not chosen when block size is too large then

mgoin

Ditto

vllm-project#22207) Signed-off-by: Giancarlo Delfin <[email protected]>

vllm-project#22207) Signed-off-by: Giancarlo Delfin <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>

vllm-project#22207) Signed-off-by: Giancarlo Delfin <[email protected]> Signed-off-by: Noam Gat <[email protected]>

vllm-project#22207) Signed-off-by: Giancarlo Delfin <[email protected]> Signed-off-by: Paul Pak <[email protected]>

vllm-project#22207) Signed-off-by: Giancarlo Delfin <[email protected]> Signed-off-by: Diego-Castan <[email protected]>

vllm-project#22207) Signed-off-by: Giancarlo Delfin <[email protected]>

vllm-project#22207) Signed-off-by: Giancarlo Delfin <[email protected]> Signed-off-by: Xiao Yu <[email protected]>

vllm-project#22207) Signed-off-by: Giancarlo Delfin <[email protected]>

[V1] reduce block size for tree attention correctness test to fix 'ou…

67dd66a

…t of resource' triton error Signed-off-by: Giancarlo Delfin <[email protected]>

mergify bot added speculative-decoding v1 labels Aug 4, 2025

TheEpicDolphin mentioned this pull request Aug 4, 2025

Add tree attention backend for v1 (part 1) #20401

Merged

gemini-code-assist bot reviewed Aug 4, 2025

View reviewed changes

tests/v1/spec_decode/test_tree_attention.py Show resolved Hide resolved

TheEpicDolphin marked this pull request as ready for review August 4, 2025 18:23

sgrigory approved these changes Aug 4, 2025

View reviewed changes

mgoin approved these changes Aug 4, 2025

View reviewed changes

mgoin added ready ONLY add when PR is ready to merge/full CI is needed ci-failure Issue about an unexpected test failure in CI labels Aug 4, 2025

github-project-automation bot added this to CI Failures Aug 4, 2025

vllm-bot merged commit 5ea71ff into vllm-project:main Aug 5, 2025
27 of 30 checks passed

github-project-automation bot moved this to Done in CI Failures Aug 5, 2025

npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025

[V1] reduce block size for tree attention correctness test to fix 'ou… (

bfb196c

vllm-project#22207) Signed-off-by: Giancarlo Delfin <[email protected]>

myselvess pushed a commit to myselvess/vllm that referenced this pull request Aug 7, 2025

[V1] reduce block size for tree attention correctness test to fix 'ou… (

35f4186

vllm-project#22207) Signed-off-by: Giancarlo Delfin <[email protected]>

noamgat pushed a commit to noamgat/vllm that referenced this pull request Aug 9, 2025

[V1] reduce block size for tree attention correctness test to fix 'ou… (

6901491

vllm-project#22207) Signed-off-by: Giancarlo Delfin <[email protected]> Signed-off-by: Noam Gat <[email protected]>

paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025

[V1] reduce block size for tree attention correctness test to fix 'ou… (

2603e16

vllm-project#22207) Signed-off-by: Giancarlo Delfin <[email protected]> Signed-off-by: Paul Pak <[email protected]>

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025

[V1] reduce block size for tree attention correctness test to fix 'ou… (

0940048

vllm-project#22207) Signed-off-by: Giancarlo Delfin <[email protected]>

xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025

[V1] reduce block size for tree attention correctness test to fix 'ou… (

cddd0da

vllm-project#22207) Signed-off-by: Giancarlo Delfin <[email protected]> Signed-off-by: Xiao Yu <[email protected]>

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025

[V1] reduce block size for tree attention correctness test to fix 'ou… (

c05767b

vllm-project#22207) Signed-off-by: Giancarlo Delfin <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[V1] reduce block size for tree attention correctness test to fix 'ou… #22207

[V1] reduce block size for tree attention correctness test to fix 'ou… #22207

Uh oh!

TheEpicDolphin commented Aug 4, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Aug 4, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

sgrigory left a comment

Uh oh!

mgoin left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[V1] reduce block size for tree attention correctness test to fix 'ou… #22207

[V1] reduce block size for tree attention correctness test to fix 'ou… #22207

Uh oh!

Conversation

TheEpicDolphin commented Aug 4, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Uh oh!

github-actions bot commented Aug 4, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

sgrigory left a comment

Choose a reason for hiding this comment

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

TheEpicDolphin commented Aug 4, 2025 •

edited by github-actions bot

Loading