-
Notifications
You must be signed in to change notification settings - Fork 312
fix(iceberg-datafusion): handle timestamp predicates from DF #1569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
4c291c7
to
0a16745
Compare
The test failure seems unrelated. |
7629743
to
ed85a34
Compare
DataFusion sometimes passes dates as string literals, but can also pass timestamp ScalarValues, which need to be converted to predicates correctly in order to enable partition pruning.
This helps with predicate expressions such as `date > DATE_TRUNC('day', ts)`.
ed85a34
to
2a1f575
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances DataFusion integration by adding support for timestamp ScalarValues in predicate conversion, enabling proper partition pruning for timestamp predicates. Previously, only string literals were handled correctly for date/time filtering.
Key changes:
- Added timestamp scalar value handling for Second, Millisecond, Microsecond, and Nanosecond precisions
- Enhanced type conversion logic in Iceberg's value system for better timestamp interoperability
- Comprehensive test coverage for various timestamp formats and timezone scenarios
Reviewed Changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 2 comments.
File | Description |
---|---|
crates/integrations/datafusion/src/physical_plan/expr_to_predicate.rs | Adds timestamp ScalarValue conversion logic and comprehensive tests for timestamp predicate handling |
crates/integrations/datafusion/Cargo.toml | Adds chrono dependency for timestamp parsing functionality |
crates/iceberg/src/spec/values.rs | Enhances Datum type conversion with comprehensive timestamp format support and cross-conversion capabilities |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
fn interpret_timestamptz_micros(micros: i64, tz: Option<impl AsRef<str>>) -> Option<Datum> { | ||
let offset = tz | ||
.as_ref() | ||
.and_then(|s| s.as_ref().parse::<FixedOffset>().ok()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parse error is silently ignored with .ok()
. Consider logging the error or providing more specific error handling to help debug timezone parsing failures.
.and_then(|s| s.as_ref().parse::<FixedOffset>().ok()); | |
.and_then(|s| match s.as_ref().parse::<FixedOffset>() { | |
Ok(offset) => Some(offset), | |
Err(e) => { | |
eprintln!("Failed to parse timezone string '{}': {}", s.as_ref(), e); | |
None | |
} | |
}); |
Copilot uses AI. Check for mistakes.
(PrimitiveLiteral::Long(val), source_type, target_type) => { | ||
match (source_type, target_type) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The nested match statement with extensive pattern matching creates high complexity. Consider extracting this logic into a separate helper function like convert_long_value
to improve readability and maintainability.
Copilot uses AI. Check for mistakes.
@@ -1196,14 +1199,92 @@ impl Datum { | |||
(PrimitiveLiteral::Int(val), _, PrimitiveType::Int) => Ok(Datum::int(*val)), | |||
(PrimitiveLiteral::Int(val), _, PrimitiveType::Date) => Ok(Datum::date(*val)), | |||
(PrimitiveLiteral::Int(val), _, PrimitiveType::Long) => Ok(Datum::long(*val)), | |||
(PrimitiveLiteral::Int(val), PrimitiveType::Date, PrimitiveType::Timestamp) => { | |||
Ok(Datum::timestamp_micros(*val as i64 * MICROS_PER_DAY)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Do a safe multiply here?
PrimitiveType::Date, | ||
PrimitiveType::TimestampNs, | ||
) => Ok(Datum::timestamp_nanos( | ||
*val as i64 * MICROS_PER_DAY * NANOS_PER_MICRO, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Do a safe multiply here? (probably less more critical then above because the range is smaller).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still ramping up on the code base but left some comments.
PrimitiveLiteral::Int(val), | ||
PrimitiveType::Date, | ||
PrimitiveType::Timestamptz, | ||
) => Ok(Datum::timestamptz_micros(*val as i64 * MICROS_PER_DAY)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment on safe multiply.
PrimitiveType::Date, | ||
PrimitiveType::TimestamptzNs, | ||
) => Ok(Datum::timestamptz_nanos( | ||
*val as i64 * MICROS_PER_DAY * NANOS_PER_MICRO, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
safe multiply ... Also, I assume the compiler would optimize this out but consider just defining NANOS_PER_DAY and multiple that?
|
||
let result = datum.to(&Primitive(PrimitiveType::Timestamp)).unwrap(); | ||
|
||
let expected = Datum::timestamp_from_datetime( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would probably be useful to test with wider ranges (i.e. year 9999-12-31 and year 0001-01-01) as well.
} | ||
|
||
#[test] | ||
fn test_datum_timestamptz_nanos_convert_to_timestamptz_micros() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still ramping up the code base, and not sure if we do it elsewhere but would be possible to parameterize this form of test to avoid the boiler plate?
fn test_predicate_conversion_with_timestamp() { | ||
// 2023-01-01 12:00:00 UTC | ||
let timestamp_scalar = ScalarValue::TimestampSecond(Some(1672574400), None); | ||
let dt = DateTime::parse_from_rfc3339("2023-01-01T12:00:00+00:00").unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment about maybe considering parameterization. of these tests.
@@ -214,18 +218,46 @@ fn scalar_value_to_datum(value: &ScalarValue) -> Option<Datum> { | |||
ScalarValue::LargeUtf8(Some(v)) => Some(Datum::string(v.clone())), | |||
ScalarValue::Date32(Some(v)) => Some(Datum::date(*v)), | |||
ScalarValue::Date64(Some(v)) => Some(Datum::date((*v / MILLIS_PER_DAY) as i32)), | |||
ScalarValue::TimestampSecond(Some(v), tz) => { | |||
interpret_timestamptz_micros(v * MICROS_PER_SECOND, tz.as_deref()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
safe multiply?
DataFusion sometimes passes dates as string literals, but can also pass timestamp ScalarValues, which need to be converted to predicates correctly in order to enable partition pruning.