-
Notifications
You must be signed in to change notification settings - Fork 319
feat(datafusion): Support insert_into in IcebergTableProvider #1511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
crates/iceberg/src/arrow/value.rs
Outdated
Ok(schema_partner) | ||
} | ||
|
||
// todo generate field_pos in datafusion instead of passing to here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found it tricky to handle this case: the input from datafusion won't have field id, and we will need to assign them manually. maybe there is a way to do name mapping here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you help me to understand why we need to change this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a temporary hack to an issue that I don't know how exactly to fix for now: the RecordBatch
from Datafusion won't have PARQUET_FIELD_ID_META_KEY
in its schema's metadata, causing the schema visiting to fail here.
I'm thinking maybe we can bound the schema in datafusion via name mapping, but have not got the chance to explore more
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need to convert RecordBatch
's schema to iceberg schema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method you mentioned is typically used to convert parquet file's schema to iceberg schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is used when using ParquetWriter
to write RecordBatch
. When it's counting nan values, it will need to walk through both RecordBatch
's schema and Iceberg schema in a partner fashion:
.compute(self.schema.clone(), batch_c)?; |
Basically the call stack is NanValueCountVisitor::compute
-> visit_struct_with_partner
-> ArrowArrayAccessor::field_partner
-> get_field_id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaination, that makes sense to me. We need a separate issue to solve this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created an issue for this: #1560
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @CTTY for this pr, just finished round of review. My suggestion is to start with unpartitioned table first.
} | ||
|
||
#[tokio::test] | ||
async fn test_insert_into() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a big fan of adding this kind of integration tests. How about adding sqllogictests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline, that we can proceed with the unit test for now as sqllogictest testing framework for iceberg-rust is still wipm and we can revisit this later
println!("----StructArray from record stream: {:?}", struct_arr); | ||
println!("----Schema.as_struct from table: {:?}", schema.as_struct()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use log here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for testing only, and I'm planning to remove these log lines
crates/iceberg/src/arrow/value.rs
Outdated
Ok(schema_partner) | ||
} | ||
|
||
// todo generate field_pos in datafusion instead of passing to here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you help me to understand why we need to change this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @CTTY , the direction looks good to me!
crates/iceberg/src/arrow/value.rs
Outdated
Ok(schema_partner) | ||
} | ||
|
||
// todo generate field_pos in datafusion instead of passing to here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need to convert RecordBatch
's schema to iceberg schema?
crates/iceberg/src/arrow/value.rs
Outdated
Ok(schema_partner) | ||
} | ||
|
||
// todo generate field_pos in datafusion instead of passing to here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method you mentioned is typically used to convert parquet file's schema to iceberg schema.
|
||
// Verify each serialized file contains expected data | ||
for json in &serialized_files { | ||
assert!(json.contains("path/to/file")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Why not assert the json output? We could use snapshot test to make it easier, see https://docs.rs/expect-test/latest/expect_test/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a snapshot test makes more sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've created a new PR to address the DataFileSerde-related changes separately
Co-authored-by: Renjie Liu <[email protected]>
Co-authored-by: Renjie Liu <[email protected]>
I'm closing this draft since it's no longer active and most of the changes have been merged |
…les (apache#1588) ## Which issue does this PR close? - Closes apache#1546 - Draft: apache#1511 ## What changes are included in this PR? - Added `IcebergCommitExec` to help commit the data files written and return the number of rows written ## Are these changes tested? Added ut
…port (apache#1585) ## Which issue does this PR close? - Closes apache#1545 - See the original draft PR: apache#1511 ## What changes are included in this PR? - Added `IcebergWriteExec` to write the input execution plan to parquet files, and returns serialized data files ## Are these changes tested? added ut
…1600) ## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - A part of #1540 - See draft: #1511 ## What changes are included in this PR? - Added `catalog` to `IcebergTableProvider` as optional - Added table refresh logic in `IcebergTableProvider::scan` - Implement `insert_into` for `IcebergTableProvider` using write node and commit node for non-partitioned tables <!-- Provide a summary of the modifications in this PR. List the main changes such as new features, bug fixes, refactoring, or any other updates. --> ## Are these changes tested? Added tests <!-- Specify what test covers (unit test, integration test, etc.). If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> --------- Co-authored-by: Renjie Liu <[email protected]>
Which issue does this PR close?
What changes are included in this PR?
Are these changes tested?