Skip to content

Conversation

jackkleeman
Copy link
Contributor

The goal of this PR is to allow cardinality statistics being passed through joins even if fields don't have max and min values set, as long as a distinct value estimate is provided.

Currently we require max and min to be set, as they might be used to estimate the distinct count. This is unnecessarily conservative if distinct_count has actually been provided, in which case max and min won't be used at all and the presence of max or min has no influence over how good of an estimate it is.

@github-actions github-actions bot added the physical-plan Changes to the physical-plan crate label Sep 8, 2025
Currently we require max and min to be set, as they might be used to
estimate the distinct count. This is unnecessarily conservative if
distinct_count has actually been provided, in which case max and min
won't be used at all and the presence of max or min has no influence
over how good of an estimate it is.
@jackkleeman jackkleeman force-pushed the min-max-join-estimation branch from b8154a5 to 778a7e9 Compare September 8, 2025 15:44
Copy link
Member

@xudong963 xudong963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good finding, after resolving the comments, i think it's good to go

@jackkleeman jackkleeman force-pushed the min-max-join-estimation branch from 2f17bdd to 60a2edb Compare September 10, 2025 09:11
(10, Absent, Absent, Inexact(3), Absent),
(10, Absent, Absent, Inexact(3), Absent),
None,
Some(Inexact(33)),
Copy link
Contributor Author

@jackkleeman jackkleeman Sep 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the behaviour of these three have changed because we will use the distinct values even though we have no range

@jackkleeman jackkleeman force-pushed the min-max-join-estimation branch from 60a2edb to d392e81 Compare September 10, 2025 09:12
@jackkleeman
Copy link
Contributor Author

should be ok now.

another thing i noticed; if you dont provide distinct count, we require you to provide range. however, we accept even an unbounded range, which is kind of no information - in that case we just use the row count as our guess, basically. also we allow string ranges which have no cardinalirty info. is this intentional? it seems to me that we should either a) not allow unbounded range, in fact only allow ranges that are numeric (so cardinality info is available) or b) not require range at all

@findepi
Copy link
Member

findepi commented Sep 10, 2025

if you dont provide distinct count, we require you to provide range. however, we accept even an unbounded range, which is kind of no information - in that case we just use the row count as our guess, basically.

That's a good observation.
This explains test changes I got after removal of the condition block, rather than make it more complicated.

  • // Break if any of statistics bounds are undefined
    if left_stat.min_value.get_value().is_none()
    || left_stat.max_value.get_value().is_none()
    || right_stat.min_value.get_value().is_none()
    || right_stat.max_value.get_value().is_none()
    {
    return None;
    }

So maybe we remove the condition after all?

@jackkleeman
Copy link
Contributor Author

Id be inclined to remove the check entirely and always use a row count as a default guess for cardinality. If you agree I will do this change

@findepi
Copy link
Member

findepi commented Sep 11, 2025

I definitely agree with the alternative

it seems to me that we should either a) not allow unbounded range, in fact only allow ranges that are numeric (so cardinality info is available) or b) not require range at all

Since we're debating a change that will do (b) and thus make the code more coherent, I agree to proceed with that.
I am not taking a stance which one of (a) or (b) is better. We can switch from (b) to (a) later, when we known which one is better.

@jackkleeman jackkleeman changed the title Support join cardinality estimation if distinct_count is set Support join cardinality estimation less conservatively Sep 11, 2025
@jackkleeman jackkleeman requested a review from findepi September 11, 2025 14:51
@findepi
Copy link
Member

findepi commented Sep 12, 2025

@alamb @ozankabak PTAL

@alamb
Copy link
Contributor

alamb commented Sep 25, 2025

@alamb alamb added this pull request to the merge queue Sep 25, 2025
@alamb
Copy link
Contributor

alamb commented Sep 25, 2025

Thanks @jackkleeman @findepi and @xudong963 !

Merged via the queue into apache:main with commit 7f70ac6 Sep 25, 2025
28 of 29 checks passed
@jackkleeman jackkleeman deleted the min-max-join-estimation branch September 25, 2025 16:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-plan Changes to the physical-plan crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants