Skip to content

Conversation

CTTY
Copy link
Contributor

@CTTY CTTY commented Jul 15, 2025

Which issue does this PR close?

What changes are included in this PR?

Are these changes tested?

Ok(schema_partner)
}

// todo generate field_pos in datafusion instead of passing to here
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found it tricky to handle this case: the input from datafusion won't have field id, and we will need to assign them manually. maybe there is a way to do name mapping here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you help me to understand why we need to change this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a temporary hack to an issue that I don't know how exactly to fix for now: the RecordBatch from Datafusion won't have PARQUET_FIELD_ID_META_KEY in its schema's metadata, causing the schema visiting to fail here.

I'm thinking maybe we can bound the schema in datafusion via name mapping, but have not got the chance to explore more

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need to convert RecordBatch's schema to iceberg schema?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method you mentioned is typically used to convert parquet file's schema to iceberg schema.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is used when using ParquetWriter to write RecordBatch. When it's counting nan values, it will need to walk through both RecordBatch's schema and Iceberg schema in a partner fashion:

.compute(self.schema.clone(), batch_c)?;

Basically the call stack is NanValueCountVisitor::compute -> visit_struct_with_partner -> ArrowArrayAccessor::field_partner -> get_field_id

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaination, that makes sense to me. We need a separate issue to solve this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created an issue for this: #1560

@CTTY CTTY force-pushed the ctty/df-insert branch from 7843b0d to 2f9efa8 Compare July 16, 2025 03:37
Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @CTTY for this pr, just finished round of review. My suggestion is to start with unpartitioned table first.

}

#[tokio::test]
async fn test_insert_into() -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a big fan of adding this kind of integration tests. How about adding sqllogictests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline, that we can proceed with the unit test for now as sqllogictest testing framework for iceberg-rust is still wipm and we can revisit this later

Comment on lines +162 to +164
println!("----StructArray from record stream: {:?}", struct_arr);
println!("----Schema.as_struct from table: {:?}", schema.as_struct());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use log here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for testing only, and I'm planning to remove these log lines

Ok(schema_partner)
}

// todo generate field_pos in datafusion instead of passing to here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you help me to understand why we need to change this?

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @CTTY , the direction looks good to me!

Ok(schema_partner)
}

// todo generate field_pos in datafusion instead of passing to here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need to convert RecordBatch's schema to iceberg schema?

Ok(schema_partner)
}

// todo generate field_pos in datafusion instead of passing to here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method you mentioned is typically used to convert parquet file's schema to iceberg schema.


// Verify each serialized file contains expected data
for json in &serialized_files {
assert!(json.contains("path/to/file"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Why not assert the json output? We could use snapshot test to make it easier, see https://docs.rs/expect-test/latest/expect_test/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a snapshot test makes more sense

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created a new PR to address the DataFileSerde-related changes separately

@CTTY CTTY force-pushed the ctty/df-insert branch from 078f458 to d6e3f37 Compare July 29, 2025 21:18
liurenjie1024 pushed a commit that referenced this pull request Aug 8, 2025
…les (#1588)

## Which issue does this PR close?

- Closes #1546 
- Draft: #1511 

## What changes are included in this PR?
- Added `IcebergCommitExec` to help commit the data files written and
return the number of rows written


## Are these changes tested?
Added ut
liurenjie1024 pushed a commit that referenced this pull request Aug 12, 2025
…port (#1585)

## Which issue does this PR close?

- Closes #1545
- See the original draft PR: #1511 

## What changes are included in this PR?
- Added `IcebergWriteExec` to write the input execution plan to parquet
files, and returns serialized data files


## Are these changes tested?
added ut
@CTTY
Copy link
Contributor Author

CTTY commented Aug 13, 2025

I'm closing this draft since it's no longer active and most of the changes have been merged

@CTTY CTTY closed this Aug 13, 2025
@CTTY CTTY deleted the ctty/df-insert branch August 13, 2025 00:45
Yiyang-C pushed a commit to Yiyang-C/iceberg-rust that referenced this pull request Aug 26, 2025
…les (apache#1588)

## Which issue does this PR close?

- Closes apache#1546 
- Draft: apache#1511 

## What changes are included in this PR?
- Added `IcebergCommitExec` to help commit the data files written and
return the number of rows written


## Are these changes tested?
Added ut
Yiyang-C pushed a commit to Yiyang-C/iceberg-rust that referenced this pull request Aug 26, 2025
…port (apache#1585)

## Which issue does this PR close?

- Closes apache#1545
- See the original draft PR: apache#1511 

## What changes are included in this PR?
- Added `IcebergWriteExec` to write the input execution plan to parquet
files, and returns serialized data files


## Are these changes tested?
added ut
Xuanwo pushed a commit that referenced this pull request Aug 28, 2025
…1600)

## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- A part of #1540 
- See draft: #1511 

## What changes are included in this PR?
- Added `catalog` to `IcebergTableProvider` as optional
- Added table refresh logic in `IcebergTableProvider::scan`
- Implement `insert_into` for `IcebergTableProvider` using write node
and commit node for non-partitioned tables
<!--
Provide a summary of the modifications in this PR. List the main changes
such as new features, bug fixes, refactoring, or any other updates.
-->

## Are these changes tested?
Added tests
<!--
Specify what test covers (unit test, integration test, etc.).

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

---------

Co-authored-by: Renjie Liu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants