Skip to content

Commit 8aa1243

Browse files
KVCrush method for cache eviction [Updated] (#2523)
Creating new and updated PR for KVCrush as I was having a tough time resolving merge conflicts on the existing PR (#2211). Please consider this as the official PR and ignore the old one. - I have addressed ALL the comments apart from a few for which I have added explanation in the old PR. - Documentation and accuracy evaluation on LongBench is added [here](https://github.com/openvinotoolkit/openvino.genai/blob/kvcrush_updated/site/docs/concepts/optimization-techniques/kvcache-eviction-algorithm.md). - KV cache budget is in terms of blocks now, not tokens. - For all the comments in the older PR where I have clarifications to make, I have added them as my comment, and have marked others as resolved (after making changes here.) Co-authored-by: Vladimir Zlobin <[email protected]>
1 parent cf847ad commit 8aa1243

File tree

14 files changed

+1038
-26
lines changed

14 files changed

+1038
-26
lines changed

site/docs/concepts/optimization-techniques/kvcache-eviction-algorithm.md

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,3 +60,40 @@ It can be enabled by setting the `CacheEvictionConfig.apply_rotation` field to `
6060
* Cache rotation is only targeted for the regular, linear LLaMa-like RoPE application and may degrade accuracy on models that use other RoPE schemes.
6161

6262
* Cache rotation is currently only supported for the models with uniform V embedding sizes across the layers.
63+
64+
## (Optional) KVCrush
65+
66+
KVCrush enhances the standard H2O/SnapKV eviction by selecting the most representative blocks from the evictable area using clustering analysis, rather than simply evicting the low score blocks.
67+
68+
### Algorithm Overview
69+
70+
1. **Indicator Creation**: Generate binary indicators for tokens based on importance scores
71+
2. **Anchor Point Generation**: Create reference patterns using configurable modes
72+
3. **Distance Calculation**: Measure Hamming distance between block patterns and the anchor point
73+
4. **Representative Selection**: Select blocks to best represent context diversity
74+
75+
### Configuration
76+
Setup KVCrush config parameters and pass it to ```CacheEvictionConfig```. Sample code to allocate KVCrush a budget of 2 blocks and use MEAN anchor mode is following.
77+
```cpp
78+
const ov::genai::CacheEvictionConfig EXAMPLE_CACHE_EVICTION_CONFIG =
79+
{32, 32, 192, ov::genai::AggregationMode::NORM_SUM, false, 8, KVCrushConfig(2, KVCrushAnchorPointMode::MEAN)};
80+
```
81+
```python
82+
CacheEvictionConfig(
83+
start_size=32,
84+
recent_size=128,
85+
max_cache_size=448,
86+
aggregation_mode=AggregationMode.NORM_SUM,
87+
apply_rotation=False,
88+
snapkv_window_size=8,
89+
kvcrush_config=KVCrushConfig(budget=2, anchor_point_mode=KVCrushAnchorPointMode.MEAN)
90+
)
91+
```
92+
93+
**Anchor Point Modes:**
94+
- `RANDOM`: Random binary pattern
95+
- `ZEROS`: All zeros pattern
96+
- `ONES`: All ones pattern
97+
- `MEAN`: Mean of indicators across blocks
98+
- `ALTERNATE`: Alternating 0-1 pattern
99+

src/cpp/include/openvino/genai/cache_eviction.hpp

Lines changed: 68 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,22 +19,83 @@ enum class AggregationMode {
1919
* of a given token in cache */
2020
};
2121

22+
/**
23+
* @brief Represents the mode of how anchor points are formed in KVCrush Cache eviction algorithm
24+
*/
25+
enum class KVCrushAnchorPointMode {
26+
RANDOM, /**<In this mode the anchor point is a random binary vector of 0s and 1s > */
27+
ZEROS, /**<In this mode the anchor point is a vector of 0s */
28+
ONES, /**<In this mode the anchor point is a vector of 1s */
29+
MEAN, /**<In this mode the anchor point is a random binary vector of 0s and 1s, where individual values are decided
30+
based on majority value */
31+
ALTERNATE /**In this mode the anchor point is a vector of alternate 0s and 1s */
32+
};
33+
34+
class KVCrushConfig {
35+
public:
36+
/**
37+
* @brief Configuration struct for the KVCrush cache eviction algorithm.
38+
*/
39+
/**
40+
* @class KVCrushConfig
41+
* @brief Configuration class for KVCrush cache mechanism.
42+
*
43+
* This class encapsulates the configuration parameters for the KVCrush cache,
44+
* including cache budget, anchor point mode, and random seed.
45+
*/
46+
47+
KVCrushConfig() = default;
48+
49+
/**
50+
* @brief Constructs a KVCrushConfig with the specified parameters.
51+
* @param budget_ The cache budget, representing the number of blocks to store.
52+
* @param anchor_point_mode_ The anchor point mode for KVCrush (see KVCrushAnchorPointMode).
53+
* @param rng_seed_ Optional random seed for reproducibility (default is 0).
54+
*/
55+
56+
KVCrushConfig(size_t budget_, KVCrushAnchorPointMode anchor_point_mode_, size_t rng_seed_ = 0)
57+
: budget(budget_),
58+
anchor_point_mode(anchor_point_mode_),
59+
rng_seed(rng_seed_) {}
60+
61+
/*KVCrush Cache budget - number of blocks*/
62+
std::size_t budget = 0;
63+
/*KVCrush Anchor point mode*/
64+
KVCrushAnchorPointMode anchor_point_mode = KVCrushAnchorPointMode::RANDOM;
65+
size_t rng_seed = 0;
66+
std::size_t get_budget() const {
67+
return budget;
68+
}
69+
};
70+
2271
/**
2372
* @brief Configuration struct for the cache eviction algorithm.
2473
*/
2574
class CacheEvictionConfig {
2675
public:
2776
CacheEvictionConfig() = default;
2877

29-
CacheEvictionConfig(size_t start_size, size_t recent_size, size_t max_cache_size, AggregationMode aggregation_mode_, bool apply_rotation_ = false, size_t snapkv_window_size_ = 8) : aggregation_mode(aggregation_mode_), apply_rotation(apply_rotation_), snapkv_window_size(snapkv_window_size_), m_start_size(start_size), m_recent_size(recent_size), m_max_cache_size(max_cache_size) {
78+
CacheEvictionConfig(size_t start_size,
79+
size_t recent_size,
80+
size_t max_cache_size,
81+
AggregationMode aggregation_mode_,
82+
bool apply_rotation_ = false,
83+
size_t snapkv_window_size_ = 8,
84+
const KVCrushConfig& kvcrush_config_ = KVCrushConfig(0, KVCrushAnchorPointMode::RANDOM))
85+
: aggregation_mode(aggregation_mode_),
86+
apply_rotation(apply_rotation_),
87+
snapkv_window_size(snapkv_window_size_),
88+
m_start_size(start_size),
89+
m_recent_size(recent_size),
90+
m_max_cache_size(max_cache_size),
91+
kvcrush_config(kvcrush_config_) {
3092
OPENVINO_ASSERT(start_size, "CacheEvictionConfig.start_size must be non-zero");
3193
OPENVINO_ASSERT(recent_size, "CacheEvictionConfig.recent_size must be non-zero");
3294
OPENVINO_ASSERT(max_cache_size, "CacheEvictionConfig.max_cache_size must be non-zero");
3395

3496
OPENVINO_ASSERT(max_cache_size > (start_size + recent_size),
3597
"CacheEvictionConfig.max_cache_size must be larger than CacheEvictionConfig.start_size + CacheEvictionConfig.recent_size");
3698
m_evictable_size = m_max_cache_size - m_start_size - m_recent_size;
37-
3899
}
39100

40101
/** @return Number of tokens between the "start" and "recent" areas of KV cache that
@@ -76,6 +137,11 @@ class CacheEvictionConfig {
76137
* score aggregation. **/
77138
size_t snapkv_window_size = 8;
78139

140+
/** KVCrush configuration for this cache eviction algorithm.
141+
* KVCrush is an additional mechanism that allows to retain some tokens in the cache
142+
* even if they are not among the most important ones.*/
143+
KVCrushConfig kvcrush_config;
144+
79145
private:
80146
/** Number of tokens in the *beginning* of KV cache that should be retained
81147
* in the KV cache for this sequence during generation. Must be non-zero and a multiple of the KV cache block size for

src/cpp/src/continuous_batching/cache_eviction.cpp

Lines changed: 33 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -220,7 +220,7 @@ namespace ov::genai {
220220
CacheEvictionAlgorithm::CacheEvictionAlgorithm(const CacheEvictionConfig &eviction_config, size_t block_size,
221221
size_t num_decoder_layers, size_t max_pool_window_size) :
222222
m_eviction_config(eviction_config), m_block_size(block_size), m_num_decoder_layers(num_decoder_layers),
223-
m_score_manager(block_size, num_decoder_layers, max_pool_window_size, eviction_config.aggregation_mode, eviction_config.get_start_size() / block_size, eviction_config.snapkv_window_size)
223+
m_score_manager(block_size, num_decoder_layers, max_pool_window_size, eviction_config.aggregation_mode, eviction_config.get_start_size() / block_size, eviction_config.snapkv_window_size), m_kvcrush_algo(eviction_config.kvcrush_config, block_size)
224224
{
225225
OPENVINO_ASSERT(!(m_eviction_config.get_start_size() % m_block_size),
226226
"CacheEvictionConfig.start_size in tokens must be a multiple of block size ", m_block_size);
@@ -265,6 +265,38 @@ namespace ov::genai {
265265
size_t num_blocks_to_evict = get_num_blocks_to_evict(decoder_layer_idx);
266266
auto evicted_block_indices = get_indices_of_blocks_to_evict(scores_for_all_evictable_blocks, num_blocks_to_evict);
267267

268+
// KVCrush: start
269+
bool should_apply_kvcrush = (m_eviction_config.kvcrush_config.budget > 0) &&
270+
(evicted_block_indices.size() >= m_eviction_config.kvcrush_config.budget);
271+
if (should_apply_kvcrush) {
272+
size_t num_tokens_in_evictable_blocks = scores_for_all_evictable_blocks.size() * m_block_size;
273+
274+
auto kvcrush_retained_block_indices = m_kvcrush_algo.get_indices_of_blocks_to_retain_using_kvcrush(
275+
num_tokens_in_evictable_blocks,
276+
evicted_block_indices,
277+
m_score_manager.get_scores()[decoder_layer_idx]);
278+
279+
// Remove the indices in kvcrush_retained_block_indices from evicted_block_indices
280+
if (!kvcrush_retained_block_indices.empty()) {
281+
// Convert both vectors to sets for efficient operations
282+
std::unordered_set<std::size_t> retained_set(kvcrush_retained_block_indices.begin(),
283+
kvcrush_retained_block_indices.end());
284+
285+
// Create a new vector containing only elements not in retained_set
286+
std::vector<std::size_t> filtered_evicted_indices;
287+
filtered_evicted_indices.reserve(evicted_block_indices.size());
288+
289+
for (const auto& idx : evicted_block_indices) {
290+
if (retained_set.find(idx) == retained_set.end()) {
291+
filtered_evicted_indices.push_back(idx);
292+
}
293+
}
294+
// Replace the original vector with the filtered one
295+
evicted_block_indices = std::move(filtered_evicted_indices);
296+
}
297+
}
298+
// KVCrush: end
299+
268300
m_num_evicted_tokens += evicted_block_indices.size() * m_block_size;
269301

270302
// No longer need to track the overall "heavy-hitter" attention scores for freshly evicted blocks

src/cpp/src/continuous_batching/cache_eviction.hpp

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
#include "openvino/openvino.hpp"
1212
#include "continuous_batching/attention_output.hpp"
1313
#include "openvino/genai/cache_eviction.hpp"
14+
#include "continuous_batching/kvcrush.hpp"
1415

1516
namespace ov::genai {
1617

@@ -215,6 +216,7 @@ class CacheEvictionAlgorithm {
215216
void remove_scores_of_evicted_blocks(const std::vector<std::size_t>& evicted_block_indices, size_t decoder_layer_idx);
216217

217218
CacheEvictionConfig m_eviction_config;
219+
KVCrushAlgorithm m_kvcrush_algo;
218220
std::size_t m_block_size;
219221
std::size_t m_num_evicted_tokens = 0;
220222
std::size_t m_num_decoder_layers;
Lines changed: 176 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,176 @@
1+
// Copyright (C) 2023-2025 Intel Corporation
2+
// SPDX-License-Identifier: Apache-2.0
3+
4+
#include "continuous_batching/kvcrush.hpp"
5+
6+
#include <random>
7+
namespace ov::genai {
8+
9+
KVCrushAlgorithm::KVCrushAlgorithm(const KVCrushConfig& kvcrush_config, size_t block_size)
10+
: m_kvcrush_config(kvcrush_config),
11+
m_block_size(block_size),
12+
rng(std::mt19937(kvcrush_config.rng_seed)) {}
13+
14+
// step 1: create_indicators_kvcrush()
15+
std::vector<size_t> KVCrushAlgorithm::create_indicators_kvcrush(size_t num_tokens_in_evictable_blocks,
16+
17+
std::vector<size_t>& evicted_block_indices,
18+
const std::vector<double>& layer_scores) {
19+
// Step 1: Sort the scores of the blocks to be evicted
20+
const auto& blocks_eligible_for_kvcrush = evicted_block_indices;
21+
std::vector<size_t> indices(num_tokens_in_evictable_blocks);
22+
std::iota(indices.begin(), indices.end(), 0);
23+
std::partial_sort(indices.begin(),
24+
indices.begin() + num_tokens_in_evictable_blocks / 2,
25+
indices.end(),
26+
[&](size_t i, size_t j) {
27+
return layer_scores[i] > layer_scores[j];
28+
});
29+
30+
std::vector<size_t> indicators(num_tokens_in_evictable_blocks, 0);
31+
for (size_t i = 0; i < num_tokens_in_evictable_blocks / 2; ++i) {
32+
indicators[indices[i]] = 1;
33+
}
34+
return indicators;
35+
}
36+
// step 2: create_anchor_point_kvcrush()
37+
std::vector<size_t> KVCrushAlgorithm::create_anchor_point_kvcrush(size_t num_tokens_in_evictable_blocks,
38+
39+
std::vector<size_t>& indicators) {
40+
// Step 2: Create a binary vector of size block_size as anchor point
41+
std::vector<size_t> anchor_point(m_block_size);
42+
// Initialize anchor_point based on anchor using switch-case
43+
switch (m_kvcrush_config.anchor_point_mode) {
44+
case KVCrushAnchorPointMode::RANDOM: {
45+
std::uniform_int_distribution<int> dist(0, 1);
46+
std::generate(anchor_point.begin(), anchor_point.end(), [&]() {
47+
return dist(rng);
48+
});
49+
} break;
50+
case KVCrushAnchorPointMode::ZEROS:
51+
std::fill(anchor_point.begin(), anchor_point.end(), 0);
52+
break;
53+
case KVCrushAnchorPointMode::ONES:
54+
std::fill(anchor_point.begin(), anchor_point.end(), 1);
55+
break;
56+
case KVCrushAnchorPointMode::MEAN: {
57+
size_t num_blocks = num_tokens_in_evictable_blocks / m_block_size;
58+
for (size_t pos = 0; pos < m_block_size; pos++) {
59+
// Calculate sum of indicators at this position across all blocks
60+
size_t sum = 0;
61+
for (size_t block_idx = 0; block_idx < num_blocks; block_idx++) {
62+
size_t idx = block_idx * m_block_size + pos;
63+
sum += indicators[idx];
64+
}
65+
66+
// Calculate mean and set anchor point based on threshold (0.5)
67+
double mean = static_cast<double>(sum) / num_blocks;
68+
anchor_point[pos] = (mean > 0.5) ? 1 : 0;
69+
}
70+
break;
71+
}
72+
case KVCrushAnchorPointMode::ALTERNATE:
73+
for (size_t i = 0; i < m_block_size; ++i) {
74+
anchor_point[i] = i % 2;
75+
}
76+
break;
77+
default:
78+
OPENVINO_THROW("Invalid anchor point type");
79+
}
80+
return anchor_point;
81+
}
82+
83+
// step 3: calculate_hamming_distance()
84+
std::vector<std::pair<size_t, size_t>> KVCrushAlgorithm::calculate_hamming_distance_kvcrush(
85+
size_t num_tokens_in_evictable_blocks,
86+
87+
std::vector<size_t>& indicators,
88+
std::vector<size_t>& anchor_point) {
89+
// Step 3: Calculate Hamming distances between anchor point and each block
90+
size_t num_blocks = num_tokens_in_evictable_blocks / m_block_size;
91+
std::vector<std::pair<size_t, size_t>> block_distances; // pair<hamming_distance, block_idx>
92+
block_distances.reserve(num_blocks);
93+
94+
for (size_t block_idx = 0; block_idx < num_blocks; ++block_idx) {
95+
size_t hamming_distance = 0;
96+
for (size_t j = 0; j < m_block_size; ++j) {
97+
size_t token_idx = block_idx * m_block_size + j;
98+
if (token_idx < num_tokens_in_evictable_blocks) {
99+
// Use the indicators vector to determine the bit value of this position
100+
int bit_value = indicators[token_idx];
101+
if (bit_value != anchor_point[j]) {
102+
hamming_distance++;
103+
}
104+
}
105+
}
106+
block_distances.emplace_back(hamming_distance, block_idx);
107+
}
108+
return block_distances;
109+
}
110+
111+
// step 4: get_representative_blocks()
112+
std::vector<std::size_t> KVCrushAlgorithm::get_representative_blocks_kvcrush(
113+
114+
size_t num_tokens_in_evictable_blocks,
115+
std::vector<std::pair<size_t, size_t>>& block_distances,
116+
const std::vector<size_t>& blocks_eligible_for_kvcrush) {
117+
// Step 4: Find the representative blocks
118+
// Filter block indices that are in blocks_eligible_for_kvcrush
119+
std::vector<size_t> filtered_block_indices;
120+
filtered_block_indices.reserve(block_distances.size());
121+
122+
for (const auto& entry : block_distances) {
123+
size_t block_idx = entry.second;
124+
// Check if block_idx is in blocks_eligible_for_kvcrush
125+
if (std::find(blocks_eligible_for_kvcrush.begin(), blocks_eligible_for_kvcrush.end(), block_idx) !=
126+
blocks_eligible_for_kvcrush.end()) {
127+
filtered_block_indices.push_back(block_idx);
128+
}
129+
}
130+
// Sort filtered_block_indices based on Hamming distance
131+
std::sort(filtered_block_indices.begin(), filtered_block_indices.end(), [&](size_t a, size_t b) {
132+
return block_distances[a].first < block_distances[b].first;
133+
});
134+
// select kvcrush_budget number of blocks from filtered_block_indices, uniformly spaced
135+
size_t num_blocks_to_retain = std::min(filtered_block_indices.size(), m_kvcrush_config.get_budget());
136+
size_t step = filtered_block_indices.size() / num_blocks_to_retain;
137+
std::vector<std::size_t> kvcrush_retained_block_indices;
138+
kvcrush_retained_block_indices.reserve(num_blocks_to_retain);
139+
for (size_t i = 0; i < num_blocks_to_retain; ++i) {
140+
size_t idx = i * step;
141+
if (idx < filtered_block_indices.size()) {
142+
kvcrush_retained_block_indices.push_back(filtered_block_indices[idx]);
143+
}
144+
}
145+
146+
return kvcrush_retained_block_indices;
147+
}
148+
149+
std::vector<std::size_t> KVCrushAlgorithm::get_indices_of_blocks_to_retain_using_kvcrush(
150+
151+
size_t num_tokens_in_evictable_blocks,
152+
std::vector<std::size_t>& evicted_block_indices,
153+
const std::vector<double>& layer_scores) {
154+
// step 1: Create indicators_kvcrush makes binary feature vectors based on top-k/2 scores
155+
const auto& blocks_eligible_for_kvcrush = evicted_block_indices; // only the blocks that are evicted by the score
156+
// based eviction are eligible for kvcrush
157+
158+
std::vector<size_t> indicators =
159+
create_indicators_kvcrush(num_tokens_in_evictable_blocks, evicted_block_indices, layer_scores);
160+
161+
// Step 2: Create anchor_point based on the selected anchor point type
162+
std::vector<size_t> anchor_point = create_anchor_point_kvcrush(num_tokens_in_evictable_blocks, indicators);
163+
164+
// Step 3: Calculate Hamming distances between anchor point and each block, where each block is represented by
165+
// its binary feature vector called indicators
166+
std::vector<std::pair<size_t, size_t>> block_distances =
167+
calculate_hamming_distance_kvcrush(num_tokens_in_evictable_blocks, indicators, anchor_point);
168+
169+
// Step 4: Find the representative blocks
170+
// Filter block indices that are in blocks_eligible_for_kvcrush
171+
return get_representative_blocks_kvcrush(num_tokens_in_evictable_blocks,
172+
block_distances,
173+
blocks_eligible_for_kvcrush);
174+
}
175+
176+
} // namespace ov::genai

0 commit comments

Comments
 (0)