Skip to content

Conversation

colinmarc
Copy link
Contributor

DataFusion sometimes passes dates as string literals, but can also pass timestamp ScalarValues, which need to be converted to predicates correctly in order to enable partition pruning.

@colinmarc colinmarc force-pushed the df-timestamps branch 3 times, most recently from 4c291c7 to 0a16745 Compare July 31, 2025 13:40
@colinmarc
Copy link
Contributor Author

The test failure seems unrelated.

@colinmarc colinmarc force-pushed the df-timestamps branch 5 times, most recently from 7629743 to ed85a34 Compare August 1, 2025 21:27
DataFusion sometimes passes dates as string literals, but can also pass
timestamp ScalarValues, which need to be converted to predicates
correctly in order to enable partition pruning.
This helps with predicate expressions such as `date > DATE_TRUNC('day',
ts)`.
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances DataFusion integration by adding support for timestamp ScalarValues in predicate conversion, enabling proper partition pruning for timestamp predicates. Previously, only string literals were handled correctly for date/time filtering.

Key changes:

  • Added timestamp scalar value handling for Second, Millisecond, Microsecond, and Nanosecond precisions
  • Enhanced type conversion logic in Iceberg's value system for better timestamp interoperability
  • Comprehensive test coverage for various timestamp formats and timezone scenarios

Reviewed Changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 2 comments.

File Description
crates/integrations/datafusion/src/physical_plan/expr_to_predicate.rs Adds timestamp ScalarValue conversion logic and comprehensive tests for timestamp predicate handling
crates/integrations/datafusion/Cargo.toml Adds chrono dependency for timestamp parsing functionality
crates/iceberg/src/spec/values.rs Enhances Datum type conversion with comprehensive timestamp format support and cross-conversion capabilities

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

fn interpret_timestamptz_micros(micros: i64, tz: Option<impl AsRef<str>>) -> Option<Datum> {
let offset = tz
.as_ref()
.and_then(|s| s.as_ref().parse::<FixedOffset>().ok());
Copy link
Preview

Copilot AI Sep 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parse error is silently ignored with .ok(). Consider logging the error or providing more specific error handling to help debug timezone parsing failures.

Suggested change
.and_then(|s| s.as_ref().parse::<FixedOffset>().ok());
.and_then(|s| match s.as_ref().parse::<FixedOffset>() {
Ok(offset) => Some(offset),
Err(e) => {
eprintln!("Failed to parse timezone string '{}': {}", s.as_ref(), e);
None
}
});

Copilot uses AI. Check for mistakes.

Comment on lines +1227 to +1228
(PrimitiveLiteral::Long(val), source_type, target_type) => {
match (source_type, target_type) {
Copy link
Preview

Copilot AI Sep 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The nested match statement with extensive pattern matching creates high complexity. Consider extracting this logic into a separate helper function like convert_long_value to improve readability and maintainability.

Copilot uses AI. Check for mistakes.

@@ -1196,14 +1199,92 @@ impl Datum {
(PrimitiveLiteral::Int(val), _, PrimitiveType::Int) => Ok(Datum::int(*val)),
(PrimitiveLiteral::Int(val), _, PrimitiveType::Date) => Ok(Datum::date(*val)),
(PrimitiveLiteral::Int(val), _, PrimitiveType::Long) => Ok(Datum::long(*val)),
(PrimitiveLiteral::Int(val), PrimitiveType::Date, PrimitiveType::Timestamp) => {
Ok(Datum::timestamp_micros(*val as i64 * MICROS_PER_DAY))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Do a safe multiply here?

PrimitiveType::Date,
PrimitiveType::TimestampNs,
) => Ok(Datum::timestamp_nanos(
*val as i64 * MICROS_PER_DAY * NANOS_PER_MICRO,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Do a safe multiply here? (probably less more critical then above because the range is smaller).

Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still ramping up on the code base but left some comments.

PrimitiveLiteral::Int(val),
PrimitiveType::Date,
PrimitiveType::Timestamptz,
) => Ok(Datum::timestamptz_micros(*val as i64 * MICROS_PER_DAY)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment on safe multiply.

PrimitiveType::Date,
PrimitiveType::TimestamptzNs,
) => Ok(Datum::timestamptz_nanos(
*val as i64 * MICROS_PER_DAY * NANOS_PER_MICRO,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

safe multiply ... Also, I assume the compiler would optimize this out but consider just defining NANOS_PER_DAY and multiple that?


let result = datum.to(&Primitive(PrimitiveType::Timestamp)).unwrap();

let expected = Datum::timestamp_from_datetime(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably be useful to test with wider ranges (i.e. year 9999-12-31 and year 0001-01-01) as well.

}

#[test]
fn test_datum_timestamptz_nanos_convert_to_timestamptz_micros() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still ramping up the code base, and not sure if we do it elsewhere but would be possible to parameterize this form of test to avoid the boiler plate?

fn test_predicate_conversion_with_timestamp() {
// 2023-01-01 12:00:00 UTC
let timestamp_scalar = ScalarValue::TimestampSecond(Some(1672574400), None);
let dt = DateTime::parse_from_rfc3339("2023-01-01T12:00:00+00:00").unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment about maybe considering parameterization. of these tests.

@@ -214,18 +218,46 @@ fn scalar_value_to_datum(value: &ScalarValue) -> Option<Datum> {
ScalarValue::LargeUtf8(Some(v)) => Some(Datum::string(v.clone())),
ScalarValue::Date32(Some(v)) => Some(Datum::date(*v)),
ScalarValue::Date64(Some(v)) => Some(Datum::date((*v / MILLIS_PER_DAY) as i32)),
ScalarValue::TimestampSecond(Some(v), tz) => {
interpret_timestamptz_micros(v * MICROS_PER_SECOND, tz.as_deref())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

safe multiply?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants