|
2 | 2 |
|
3 | 3 | This guide covers optimization strategies and performance tuning for vLLM V1.
|
4 | 4 |
|
| 5 | +!!! tip |
| 6 | + Running out of memory? Consult [this guide](./conserving_memory.md) on how to conserve memory. |
| 7 | + |
5 | 8 | ## Preemption
|
6 | 9 |
|
7 | 10 | Due to the auto-regressive nature of transformer architecture, there are times when KV cache space is insufficient to handle all batched requests.
|
@@ -126,62 +129,44 @@ Data parallelism replicates the entire model across multiple GPU sets and proces
|
126 | 129 | Data parallelism can be combined with the other parallelism strategies and is set by `data_parallel_size=N`.
|
127 | 130 | Note that MoE layers will be sharded according to the product of the tensor parallel size and data parallel size.
|
128 | 131 |
|
129 |
| -## Reducing Memory Usage |
130 |
| - |
131 |
| -If you encounter out-of-memory issues, consider these strategies: |
| 132 | +## Input Processing |
132 | 133 |
|
133 |
| -### Context Length and Batch Size |
| 134 | +### Parallel Processing |
134 | 135 |
|
135 |
| -You can reduce memory usage by limiting the context length and batch size: |
| 136 | +You can run input processing in parallel via [API server scale-out](../serving/data_parallel_deployment.md#internal-load-balancing). |
| 137 | +This is useful when input processing (which is run inside the API server) |
| 138 | +becomes a bottleneck compared to model execution (which is run inside engine core) |
| 139 | +and you have excess CPU capacity. |
136 | 140 |
|
137 |
| -```python |
138 |
| -from vllm import LLM |
| 141 | +```console |
| 142 | +# Run 4 API processes and 1 engine core process |
| 143 | +vllm serve Qwen/Qwen2.5-VL-3B-Instruct --api-server-count 4 |
139 | 144 |
|
140 |
| -llm = LLM( |
141 |
| - model="meta-llama/Llama-3.1-8B-Instruct", |
142 |
| - max_model_len=2048, # Limit context window |
143 |
| - max_num_seqs=4 # Limit batch size |
144 |
| -) |
| 145 | +# Run 4 API processes and 2 engine core processes |
| 146 | +vllm serve Qwen/Qwen2.5-VL-3B-Instruct --api-server-count 4 -dp 2 |
145 | 147 | ```
|
146 | 148 |
|
147 |
| -### Adjust CUDA Graph Compilation |
| 149 | +!!! note |
| 150 | + API server scale-out is only available for online inference. |
148 | 151 |
|
149 |
| -CUDA graph compilation in V1 uses more memory than in V0. You can reduce memory usage by adjusting the compilation level: |
150 |
| - |
151 |
| -```python |
152 |
| -from vllm import LLM |
153 |
| -from vllm.config import CompilationConfig, CompilationLevel |
154 |
| - |
155 |
| -llm = LLM( |
156 |
| - model="meta-llama/Llama-3.1-8B-Instruct", |
157 |
| - compilation_config=CompilationConfig( |
158 |
| - level=CompilationLevel.PIECEWISE, |
159 |
| - cudagraph_capture_sizes=[1, 2, 4, 8] # Capture fewer batch sizes |
160 |
| - ) |
161 |
| -) |
162 |
| -``` |
| 152 | +!!! note |
| 153 | + [Multi-modal processor cache](#processor-cache) is disabled when API server scale-out is enabled |
| 154 | + because it requires a one-to-one correspondance between API and engine core processes. |
163 | 155 |
|
164 |
| -Or, if you are not concerned about latency or overall performance, disable CUDA graph compilation entirely with `enforce_eager=True`: |
| 156 | +## Multi-Modal Caching |
165 | 157 |
|
166 |
| -```python |
167 |
| -from vllm import LLM |
| 158 | +### Processor Cache |
168 | 159 |
|
169 |
| -llm = LLM( |
170 |
| - model="meta-llama/Llama-3.1-8B-Instruct", |
171 |
| - enforce_eager=True # Disable CUDA graph compilation |
172 |
| -) |
173 |
| -``` |
| 160 | +By default, the multi-modal processor cache is enabled to avoid repeatedly processing |
| 161 | +the same multi-modal inputs via Hugging Face `AutoProcessor`, |
| 162 | +which commonly occurs in multi-turn conversations. |
174 | 163 |
|
175 |
| -### Multimodal Models |
| 164 | +You can adjust the size of the cache via `VLLM_MM_INPUT_CACHE_GIB` environment variable |
| 165 | +(default 4 GiB per API process + 4 GiB per engine core process). |
176 | 166 |
|
177 |
| -For multi-modal models, you can reduce memory usage by limiting the number of images/videos per request: |
| 167 | +If you do not benefit much from the cache, you can disable it completely via `disable_mm_preprocessor_cache`: |
178 | 168 |
|
179 | 169 | ```python
|
180 |
| -from vllm import LLM |
181 |
| - |
182 |
| -# Accept up to 2 images per prompt |
183 |
| -llm = LLM( |
184 |
| - model="Qwen/Qwen2.5-VL-3B-Instruct", |
185 |
| - limit_mm_per_prompt={"image": 2} |
186 |
| -) |
| 170 | +llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct", |
| 171 | + disable_mm_preprocessor_cache=True) |
187 | 172 | ```
|
0 commit comments