Support join cardinality estimation less conservatively #17476

jackkleeman · 2025-09-08T15:43:47Z

The goal of this PR is to allow cardinality statistics being passed through joins even if fields don't have max and min values set, as long as a distinct value estimate is provided.

Currently we require max and min to be set, as they might be used to estimate the distinct count. This is unnecessarily conservative if distinct_count has actually been provided, in which case max and min won't be used at all and the presence of max or min has no influence over how good of an estimate it is.

datafusion/physical-plan/src/joins/utils.rs

xudong963

Good finding, after resolving the comments, i think it's good to go

Co-authored-by: Piotr Findeisen <[email protected]>

jackkleeman · 2025-09-10T09:11:39Z

datafusion/physical-plan/src/joins/utils.rs

                (10, Absent, Absent, Inexact(3), Absent),
                (10, Absent, Absent, Inexact(3), Absent),
-                None,
+                Some(Inexact(33)),


the behaviour of these three have changed because we will use the distinct values even though we have no range

jackkleeman · 2025-09-10T09:14:24Z

should be ok now.

another thing i noticed; if you dont provide distinct count, we require you to provide range. however, we accept even an unbounded range, which is kind of no information - in that case we just use the row count as our guess, basically. also we allow string ranges which have no cardinalirty info. is this intentional? it seems to me that we should either a) not allow unbounded range, in fact only allow ranges that are numeric (so cardinality info is available) or b) not require range at all

findepi · 2025-09-10T15:25:09Z

if you dont provide distinct count, we require you to provide range. however, we accept even an unbounded range, which is kind of no information - in that case we just use the row count as our guess, basically.

That's a good observation.
This explains test changes I got after removal of the condition block, rather than make it more complicated.

datafusion/datafusion/physical-plan/src/joins/utils.rs

Lines 566 to 573 in a951fc9

    
           // Break if any of statistics bounds are undefined 
        
           if left_stat.min_value.get_value().is_none() 
        
               || left_stat.max_value.get_value().is_none() 
        
               || right_stat.min_value.get_value().is_none() 
        
               || right_stat.max_value.get_value().is_none() 
        
           { 
        
               return None; 
        
           }

So maybe we remove the condition after all?

jackkleeman · 2025-09-11T11:55:14Z

Id be inclined to remove the check entirely and always use a row count as a default guess for cardinality. If you agree I will do this change

findepi · 2025-09-11T14:21:18Z

I definitely agree with the alternative

it seems to me that we should either a) not allow unbounded range, in fact only allow ranges that are numeric (so cardinality info is available) or b) not require range at all

Since we're debating a change that will do (b) and thus make the code more coherent, I agree to proceed with that.
I am not taking a stance which one of (a) or (b) is better. We can switch from (b) to (a) later, when we known which one is better.

findepi · 2025-09-12T05:56:58Z

@alamb @ozankabak PTAL

alamb · 2025-09-25T15:35:00Z

Maybe related to

Discussion: API for Join Access Path and Join Order Selection #17718

alamb · 2025-09-25T15:35:29Z

Thanks @jackkleeman @findepi and @xudong963 !

github-actions bot added the physical-plan Changes to the physical-plan crate label Sep 8, 2025

jackkleeman force-pushed the min-max-join-estimation branch from b8154a5 to 778a7e9 Compare September 8, 2025 15:44

findepi reviewed Sep 10, 2025

View reviewed changes

datafusion/physical-plan/src/joins/utils.rs Outdated Show resolved Hide resolved

datafusion/physical-plan/src/joins/utils.rs Show resolved Hide resolved

datafusion/physical-plan/src/joins/utils.rs Outdated Show resolved Hide resolved

xudong963 reviewed Sep 10, 2025

View reviewed changes

Update datafusion/physical-plan/src/joins/utils.rs

1d3ce8f

Co-authored-by: Piotr Findeisen <[email protected]>

jackkleeman force-pushed the min-max-join-estimation branch from 2f17bdd to 60a2edb Compare September 10, 2025 09:11

jackkleeman commented Sep 10, 2025

View reviewed changes

Update tests

d392e81

jackkleeman force-pushed the min-max-join-estimation branch from 60a2edb to d392e81 Compare September 10, 2025 09:12

Calculate cardinality even if distinct or min/max not provided

2971c85

jackkleeman changed the title ~~Support join cardinality estimation if distinct_count is set~~ Support join cardinality estimation less conservatively Sep 11, 2025

jackkleeman requested a review from findepi September 11, 2025 14:51

findepi approved these changes Sep 12, 2025

View reviewed changes

alamb added this pull request to the merge queue Sep 25, 2025

Merged via the queue into apache:main with commit 7f70ac6 Sep 25, 2025
28 of 29 checks passed

jackkleeman deleted the min-max-join-estimation branch September 25, 2025 16:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support join cardinality estimation less conservatively #17476

Support join cardinality estimation less conservatively #17476

jackkleeman commented Sep 8, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xudong963 left a comment

Uh oh!

jackkleeman Sep 10, 2025 •

edited

Loading

Uh oh!

jackkleeman commented Sep 10, 2025

Uh oh!

findepi commented Sep 10, 2025

Uh oh!

jackkleeman commented Sep 11, 2025

Uh oh!

findepi commented Sep 11, 2025

Uh oh!

findepi commented Sep 12, 2025

Uh oh!

alamb commented Sep 25, 2025

Uh oh!

alamb commented Sep 25, 2025

Uh oh!

Uh oh!

Uh oh!

Support join cardinality estimation less conservatively #17476

Support join cardinality estimation less conservatively #17476

Conversation

jackkleeman commented Sep 8, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xudong963 left a comment

Choose a reason for hiding this comment

Uh oh!

jackkleeman Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jackkleeman commented Sep 10, 2025

Uh oh!

findepi commented Sep 10, 2025

Uh oh!

jackkleeman commented Sep 11, 2025

Uh oh!

findepi commented Sep 11, 2025

Uh oh!

findepi commented Sep 12, 2025

Uh oh!

alamb commented Sep 25, 2025

Uh oh!

alamb commented Sep 25, 2025

Uh oh!

Uh oh!

Uh oh!

jackkleeman Sep 10, 2025 •

edited

Loading