-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Support join cardinality estimation less conservatively #17476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Currently we require max and min to be set, as they might be used to estimate the distinct count. This is unnecessarily conservative if distinct_count has actually been provided, in which case max and min won't be used at all and the presence of max or min has no influence over how good of an estimate it is.
b8154a5
to
778a7e9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good finding, after resolving the comments, i think it's good to go
Co-authored-by: Piotr Findeisen <[email protected]>
2f17bdd
to
60a2edb
Compare
(10, Absent, Absent, Inexact(3), Absent), | ||
(10, Absent, Absent, Inexact(3), Absent), | ||
None, | ||
Some(Inexact(33)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the behaviour of these three have changed because we will use the distinct values even though we have no range
60a2edb
to
d392e81
Compare
should be ok now. another thing i noticed; if you dont provide distinct count, we require you to provide range. however, we accept even an unbounded range, which is kind of no information - in that case we just use the row count as our guess, basically. also we allow string ranges which have no cardinalirty info. is this intentional? it seems to me that we should either a) not allow unbounded range, in fact only allow ranges that are numeric (so cardinality info is available) or b) not require range at all |
That's a good observation.
So maybe we remove the condition after all? |
Id be inclined to remove the check entirely and always use a row count as a default guess for cardinality. If you agree I will do this change |
I definitely agree with the alternative
Since we're debating a change that will do (b) and thus make the code more coherent, I agree to proceed with that. |
@alamb @ozankabak PTAL |
Maybe related to |
Thanks @jackkleeman @findepi and @xudong963 ! |
The goal of this PR is to allow cardinality statistics being passed through joins even if fields don't have max and min values set, as long as a distinct value estimate is provided.
Currently we require max and min to be set, as they might be used to estimate the distinct count. This is unnecessarily conservative if distinct_count has actually been provided, in which case max and min won't be used at all and the presence of max or min has no influence over how good of an estimate it is.