feat(datafusion): Support insert_into in IcebergTableProvider #1511

CTTY · 2025-07-15T22:49:50Z

Which issue does this PR close?

A part of [EPIC] Support for appending data to iceberg table. #1382

What changes are included in this PR?

Are these changes tested?

CTTY · 2025-07-16T00:51:52Z

crates/iceberg/src/arrow/value.rs

        Ok(schema_partner)
    }

+    // todo generate field_pos in datafusion instead of passing to here


I found it tricky to handle this case: the input from datafusion won't have field id, and we will need to assign them manually. maybe there is a way to do name mapping here?

Could you help me to understand why we need to change this?

This is a temporary hack to an issue that I don't know how exactly to fix for now: the RecordBatch from Datafusion won't have PARQUET_FIELD_ID_META_KEY in its schema's metadata, causing the schema visiting to fail here.

I'm thinking maybe we can bound the schema in datafusion via name mapping, but have not got the chance to explore more

Why we need to convert RecordBatch's schema to iceberg schema?

The method you mentioned is typically used to convert parquet file's schema to iceberg schema.

This method is used when using ParquetWriter to write RecordBatch. When it's counting nan values, it will need to walk through both RecordBatch's schema and Iceberg schema in a partner fashion:

iceberg-rust/crates/iceberg/src/writer/file_writer/parquet_writer.rs

Line 528 in 9787140

.compute(self.schema.clone(), batch_c)?;

Basically the call stack is NanValueCountVisitor::compute -> visit_struct_with_partner -> ArrowArrayAccessor::field_partner -> get_field_id

Thanks for explaination, that makes sense to me. We need a separate issue to solve this.

Created an issue for this: #1560

crates/iceberg/src/spec/manifest/_serde.rs

liurenjie1024

Thanks @CTTY for this pr, just finished round of review. My suggestion is to start with unpartitioned table first.

crates/integrations/datafusion/src/table/mod.rs

crates/integrations/datafusion/src/physical_plan/write.rs

crates/iceberg/src/spec/manifest/mod.rs

crates/integrations/datafusion/src/physical_plan/write.rs

liurenjie1024 · 2025-07-16T10:30:04Z

crates/integrations/datafusion/tests/integration_datafusion_test.rs

 }
+
+#[tokio::test]
+async fn test_insert_into() -> Result<()> {


I'm not a big fan of adding this kind of integration tests. How about adding sqllogictests?

Discussed offline, that we can proceed with the unit test for now as sqllogictest testing framework for iceberg-rust is still wipm and we can revisit this later

crates/integrations/datafusion/src/physical_plan/write.rs

crates/integrations/datafusion/src/table/mod.rs

crates/integrations/datafusion/src/physical_plan/write.rs

crates/integrations/datafusion/src/physical_plan/commit.rs

crates/iceberg/src/spec/manifest/mod.rs

liurenjie1024 · 2025-07-21T09:58:01Z

crates/iceberg/src/arrow/nan_val_cnt_visitor.rs

+        println!("----StructArray from record stream: {:?}", struct_arr);
+        println!("----Schema.as_struct from table: {:?}", schema.as_struct());


We should use log here.

This is for testing only, and I'm planning to remove these log lines

liurenjie1024 · 2025-07-21T09:59:51Z

crates/iceberg/src/arrow/value.rs

        Ok(schema_partner)
    }

+    // todo generate field_pos in datafusion instead of passing to here


Could you help me to understand why we need to change this?

liurenjie1024

Thanks @CTTY , the direction looks good to me!

liurenjie1024 · 2025-07-22T09:49:31Z

crates/iceberg/src/arrow/value.rs

        Ok(schema_partner)
    }

+    // todo generate field_pos in datafusion instead of passing to here


Why we need to convert RecordBatch's schema to iceberg schema?

liurenjie1024 · 2025-07-22T09:50:39Z

crates/iceberg/src/arrow/value.rs

        Ok(schema_partner)
    }

+    // todo generate field_pos in datafusion instead of passing to here


The method you mentioned is typically used to convert parquet file's schema to iceberg schema.

liurenjie1024 · 2025-07-22T10:12:06Z

crates/iceberg/src/spec/manifest/mod.rs

+
+        // Verify each serialized file contains expected data
+        for json in &serialized_files {
+            assert!(json.contains("path/to/file"));


nit: Why not assert the json output? We could use snapshot test to make it easier, see https://docs.rs/expect-test/latest/expect_test/

I think a snapshot test makes more sense

I've created a new PR to address the DataFileSerde-related changes separately

crates/iceberg/src/writer/base_writer/rolling_writer.rs

Co-authored-by: Renjie Liu <[email protected]>

…les (#1588) ## Which issue does this PR close? - Closes #1546 - Draft: #1511 ## What changes are included in this PR? - Added `IcebergCommitExec` to help commit the data files written and return the number of rows written ## Are these changes tested? Added ut

…port (#1585) ## Which issue does this PR close? - Closes #1545 - See the original draft PR: #1511 ## What changes are included in this PR? - Added `IcebergWriteExec` to write the input execution plan to parquet files, and returns serialized data files ## Are these changes tested? added ut

CTTY · 2025-08-13T00:45:14Z

I'm closing this draft since it's no longer active and most of the changes have been merged

…les (apache#1588) ## Which issue does this PR close? - Closes apache#1546 - Draft: apache#1511 ## What changes are included in this PR? - Added `IcebergCommitExec` to help commit the data files written and return the number of rows written ## Are these changes tested? Added ut

…port (apache#1585) ## Which issue does this PR close? - Closes apache#1545 - See the original draft PR: apache#1511 ## What changes are included in this PR? - Added `IcebergWriteExec` to write the input execution plan to parquet files, and returns serialized data files ## Are these changes tested? added ut

…1600) ## Which issue does this PR close?  - A part of #1540 - See draft: #1511 ## What changes are included in this PR? - Added `catalog` to `IcebergTableProvider` as optional - Added table refresh logic in `IcebergTableProvider::scan` - Implement `insert_into` for `IcebergTableProvider` using write node and commit node for non-partitioned tables  ## Are these changes tested? Added tests  --------- Co-authored-by: Renjie Liu <[email protected]>

CTTY commented Jul 16, 2025

View reviewed changes

crates/iceberg/src/spec/manifest/_serde.rs Outdated Show resolved Hide resolved

CTTY force-pushed the ctty/df-insert branch from 7843b0d to 2f9efa8 Compare July 16, 2025 03:37

liurenjie1024 reviewed Jul 16, 2025

View reviewed changes

liurenjie1024 reviewed Jul 17, 2025

View reviewed changes

liurenjie1024 reviewed Jul 21, 2025

View reviewed changes

liurenjie1024 reviewed Jul 22, 2025

View reviewed changes

CTTY mentioned this pull request Jul 23, 2025

Implement insert_into for IcebergTableProvider #1540

Open

6 tasks

CTTY and others added 22 commits July 29, 2025 14:06

Support Datafusion insert_into

4e12e6e

cleanup

b756b34

minor

e37d91a

minor

39774df

clippy ftw

f4a76dd

minor

61bd43c

minor

5c4145a

i luv cleaning up

01dad31

fmt not working?

638df22

do not expose serde

0d1e202

cut it down

f65bc65

Use stricter wrapper data file wrapper

fa1826e

fix partitioning, and fmt ofc

5e9e7e7

minor

6145dbe

partitioned shall not pass

caaa6e6

implement children and with_new_children for write node, fix fmt

71d52ff

get row counts from data files directly

613f7d9

Update crates/integrations/datafusion/src/physical_plan/write.rs

4dffe98

Co-authored-by: Renjie Liu <[email protected]>

Update crates/integrations/datafusion/src/physical_plan/commit.rs

22d14bf

Co-authored-by: Renjie Liu <[email protected]>

fix fmt, input boundedness

392ad1a

make data_files constant

712ccd5

use format version when serde datafiles

a141728

CTTY added 11 commits July 29, 2025 14:06

use try_new instead

c68dda6

minor

b28f15b

coalesce partitions

0b869a6

minor

e252bf1

fmt

2c06cfa

rolling

3bf7511

rolling in the deep

f642bf0

rolls the unit tests

c783ebf

could have it all for tests

b888cab

new rolling

315d9a7

rebase and clean up

d6e3f37

CTTY force-pushed the ctty/df-insert branch from 078f458 to d6e3f37 Compare July 29, 2025 21:18

CTTY added 2 commits July 29, 2025 14:22

uncomment catalog commit

950e5b3

cleaner

42bf200

CTTY mentioned this pull request Jul 29, 2025

Need to use field.name to determine arrow field's position when PARQUET:field_id is unavailable #1560

Closed

This was referenced Aug 6, 2025

feat(datafusion): Implement IcebergWriteExec for DataFusion write support #1585

Merged

feat(datafusion): Add IcebergCommitExec to commit the written data files #1588

Merged

CTTY mentioned this pull request Aug 13, 2025

feat(datafusion): Implement insert_into for IcebergTableProvider #1600

Merged

CTTY closed this Aug 13, 2025

CTTY deleted the ctty/df-insert branch August 13, 2025 00:45

		println!("----StructArray from record stream: {:?}", struct_arr);
		println!("----Schema.as_struct from table: {:?}", schema.as_struct());

feat(datafusion): Support insert_into in IcebergTableProvider #1511

feat(datafusion): Support insert_into in IcebergTableProvider #1511

Uh oh!

Conversation

CTTY commented Jul 15, 2025

Which issue does this PR close?

What changes are included in this PR?

Are these changes tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

liurenjie1024 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liurenjie1024 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

CTTY commented Aug 13, 2025

Uh oh!

Uh oh!