Skip to content

Conversation

cklin
Copy link
Contributor

@cklin cklin commented Sep 4, 2025

This PR adds automation ID to the overlay-base database cache key so that we properly distinguish different analyses in the same repo for the same language.

Since I am changing the cache key format, I also moved the CodeQL bundle version to the end of the cache restore key, in case we want to remove it from the restore key sometime in the future.

Note that I chose to leave CACHE_VERSION unchanged because the old and the new cache keys are sufficiently different that there should be no risk of confusion.

Changes in this PR has been validated in an internal test repository.

Risk assessment

For internal use only. Please select the risk level of this change:

  • Low risk: Changes are fully under feature flags, or have been fully tested and validated in pre-production environments and are highly observable, or are documentation or test only.

Merge / deployment checklist

  • Confirm this change is backwards compatible with existing workflows.
  • Consider adding a changelog entry for this change.
  • Confirm the readme and docs have been updated if necessary.

This commit adds automation ID to the overlay-base database cache key so
that we properly distinguish different analyses in the same repo for the
same language.

Since I am changing the cache key format, I also moved the CodeQL bundle
version to the end of the cache restore key, in case we want to remove
it from the restore key sometime in the future.

Note that I chose to leave CACHE_VERSION unchanged because the old and
the new cache keys are sufficiently different that there should be no
risk of confusion.
@cklin cklin marked this pull request as ready for review September 4, 2025 21:52
@cklin cklin requested a review from a team as a code owner September 4, 2025 21:52
@Copilot Copilot AI review requested due to automatic review settings September 4, 2025 21:52
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the overlay-base database caching mechanism by adding automation ID to the cache key to properly distinguish different analyses in the same repository for the same language. The change restructures the cache key format and includes hashing of additional components while moving the CodeQL bundle version to the end for future flexibility.

  • Adds automation ID to cache key components for better analysis differentiation
  • Introduces a hashing mechanism for cache key components to maintain manageable key length
  • Restructures the cache key format and makes getCacheRestoreKey async

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

File Description
src/overlay-database-utils.ts Implements the core changes: adds automation ID to cache key, converts getCacheRestoreKey to async, and introduces component hashing
src/overlay-database-utils.test.ts Updates tests to mock getAutomationID function for the new async cache key generation
lib/init-action.js Generated JavaScript compilation output reflecting the TypeScript changes
lib/analyze-action.js Generated JavaScript compilation output reflecting the TypeScript changes

@cklin cklin requested a review from mbg September 4, 2025 22:06
mbg
mbg previously approved these changes Sep 5, 2025
Copy link
Member

@mbg mbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thank you for proactively tackling this problem! I agree with your reasoning for keeping the cache version the same as well.

I only have a few minor, non-blocking comments.

function createCacheKeyHash(components: Record<string, any>): string {
const componentsJson = JSON.stringify(
components,
Object.keys(components).sort(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: This makes sense, but it's not very obvious from the documentation for JSON.stringify that the keys in the first argument will be stringified in the order given by the second argument, so I had to convince myself that this works as intended by actually trying it. Perhaps worth a small note here or a quick unit test that throws some permutations of an object at this function and checks that the resulting hash is the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good question. I read the JSON.stringify() documentation, and you are right that there is no mentioning of the replacer array (2nd argument here) affecting the ordering of object properties.

What I found was a promise that object property ordering is already guaranteed to be stable without the replacer array. So I copied that documentation into a code comment and removed the replacer arugment.

Comment on lines +432 to +434
// Technically we can also include languages and codeQlVersion in the
// componentsHash, but including them explicitly in the cache key makes it
// easier to debug and understand the cache key structure.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: I was initially going to ask whether languages should be hashed too (since it can potentially be a long list), but then I saw your comment here, which makes sense.

For completeness, another option to consider here might be to only hash languages if there is more than one. That keeps the benefit of making it easier to understand for the "usual" case, while avoiding overly long keys.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion!

When I looked into it a while back, IIRC most analyses list 1 (most common) or 2 (much less common) languages. A handful have 3, and none have more. So I think I will keep things the way they are right now.

const componentsHash = createCacheKeyHash(cacheKeyComponents);

// For a cached overlay-base database to be considered compatible for overlay
// analysis, all components in the cache restore key must match:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't recall if we discussed the possibility of using partial restore keys in the PR that added the overlay caches (I skimmed over it again, but didn't see anything). You have probably considered it already (since there's a comment further up talking about partial restore keys), but it could be worth adding once you are happy with how stable the overlay database functionality is (e.g. to use an existing overlay database when a new CodeQL CLI version is released without dropping the CodeQL version from the cache key).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the interesting comment! We will keep that in consideration.

@@ -393,10 +395,14 @@ async function generateCacheKey(
checkoutPath: string,
): Promise<string> {
const sha = await getCommitOid(checkoutPath);
return `${getCacheRestoreKey(config, codeQlVersion)}${sha}`;
const restoreKey = await getCacheRestoreKey(config, codeQlVersion);
return `${restoreKey}${sha}`;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I am not sure I realised until now is that restoreCache is happy to use the primary cache key as a prefix for restoring a cache. I think I was previously under the assumption that only the partial restore keys (if any) were prefix-matched. It might be worth adding a comment for this somewhere, since it might otherwise be non-obvious how restoring the cache can work if sha is included here, but not when restoring the cache.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it for quite a while. What is an effective way to highlight the prefix matching used for cache restore?

I ended up making three changes:

  • Rename functions and variables to highlight the difference between "save key" and "restore key"
  • Append "Prefix" to the restore key function and variable names to highlight the fact that they are key prefixes
  • Add comment that the save key consists of the restore key prefix followed by the checkout SHA

Hopefully that will make things clearer!

async function getCacheRestoreKey(
config: Config,
codeQlVersion: string,
): Promise<string> {
// The restore key (prefix) specifies which cached overlay-base databases are
// compatible with the current analysis: the cached database must have the
// same cache version and the same CodeQL bundle version.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this comment be removed now that there's a more detailed one further down / merged with that one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right that there is a lot of overlap between the two comment blocks, but they serve somewhat different purposes.

I think of this comment as mainly about "why are we returning only one key when Actions Cache supports multiple restore keys". The comment near the end of this function, in contrast, is about what goes into the restore key.

In any case, your point is well taken, and I updated the comments to remove the overlap.

This commit updates componentsJson computation to call JSON.stringify()
without the replacer array and documents why the result is stable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants