[Bug]: Unable to enable prefix caching for glm-4 model (not -hf versions)

### Your current environment

<details>
<summary>The output of `python collect_env.py`</summary>

```text
Your output of `python collect_env.py` here
```

</details>


### Model Input Dumps

_No response_

### 🐛 Describe the bug

With glm-4 model, prefix caching is automatically disabled, because the engine is treating it as MLLM.

Maybe related to following codes:

https://github.com/vllm-project/vllm/blob/d427e5cfda8d2536b81e6021128e71b2dbc281aa/vllm/model_executor/models/chatglm.py#L758-L782
https://github.com/vllm-project/vllm/blob/d427e5cfda8d2536b81e6021128e71b2dbc281aa/vllm/engine/arg_utils.py#L1046-L1051

Unfortunately, the glm-4 and glm-4v models have the same `model_type` value, how can I override this behavior without changing the code?

### Before submitting a new issue...

- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

	@MULTIMODAL_REGISTRY.register_image_input_mapper(mm_input_mapper_for_glmv)
	@MULTIMODAL_REGISTRY.register_max_image_tokens(get_max_glmv_image_tokens)
	@INPUT_REGISTRY.register_dummy_data(dummy_data_for_glmv)
	@INPUT_REGISTRY.register_input_processor(input_processor_for_glmv)
	class ChatGLMForCausalLM(ChatGLMBaseModel, SupportsLoRA, SupportsPP,
	SupportsMultiModal):
	# Ensure that the LoRA support check passes when the class is not
	# initialized, but set all these attributes to empty.
	packed_modules_mapping = {}
	supported_lora_modules = []
	embedding_modules = {}
	embedding_padding_modules = []

	def __new__(
	cls,
	vllm_config: VllmConfig,
	prefix: str = "",
	) -> None:
	config = vllm_config.model_config.hf_config
	# Initialize VL
	if hasattr(config, "visual"):
	return ChatGLMV(vllm_config=vllm_config, prefix=prefix)
	# Initialize LLM
	else:
	return ChatGLM(vllm_config=vllm_config, prefix=prefix)

	if (model_config.is_multimodal_model and not envs.VLLM_USE_V1
	and self.enable_prefix_caching):
	logger.warning("--enable-prefix-caching is currently not "
	"supported for multimodal models in v0 and "
	"has been disabled.")
	self.enable_prefix_caching = False

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: Unable to enable prefix caching for glm-4 model (not -hf versions) #11585

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: Unable to enable prefix caching for glm-4 model (not -hf versions) #11585

Description

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions