Skip to content

Extension metadata dropped from literals in SQL VALUES clause #17425

@paleolimbot

Description

@paleolimbot

Describe the bug

Function calls that return scalars can be used in SQL VALUES; however if they contain extension metadata the metadata is dropped.

To Reproduce

Output:

Regular select:
Field { name: "extension", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {"ARROW:extension:metadata": "foofy.foofy"} }


VALUES select:
Field { name: "extension", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }
use std::collections::HashMap;

use datafusion::{
    arrow::datatypes::DataType,
    logical_expr::{ScalarUDFImpl, Signature, Volatility},
    prelude::*,
};

#[tokio::main]
async fn main() {
    let ctx = SessionContext::new();
    ctx.register_udf(MakeExtension::default().into());

    let batches = ctx
        .sql("SELECT make_extension('foofy zero') as extension")
        .await
        .unwrap()
        .collect()
        .await
        .unwrap();
    println!("Regular select:");
    println!("{:?}", batches[0].schema().field(0));

    let batches = ctx
        .sql(
            "
SELECT extension FROM (VALUES
    ('one', make_extension('foofy one')),
    ('two', make_extension('foofy two')),
    ('three', make_extension('foofy three')))
AS t(string, extension)
        ",
        )
        .await
        .unwrap()
        .collect()
        .await
        .unwrap();

    println!("\nVALUES select:");
    println!("{:?}", batches[0].schema().field(0));
}

#[derive(Debug)]
struct MakeExtension {
    signature: Signature,
}

impl Default for MakeExtension {
    fn default() -> Self {
        Self {
            signature: Signature::user_defined(Volatility::Immutable),
        }
    }
}

impl ScalarUDFImpl for MakeExtension {
    fn as_any(&self) -> &dyn std::any::Any {
        self
    }

    fn name(&self) -> &str {
        "make_extension"
    }

    fn signature(&self) -> &Signature {
        &self.signature
    }

    fn coerce_types(&self, arg_types: &[DataType]) -> datafusion::error::Result<Vec<DataType>> {
        Ok(arg_types.to_vec())
    }

    fn return_type(&self, _arg_types: &[DataType]) -> datafusion::error::Result<DataType> {
        unreachable!("This shouldn't have been called")
    }

    fn return_field_from_args(
        &self,
        args: datafusion::logical_expr::ReturnFieldArgs,
    ) -> datafusion::error::Result<datafusion::arrow::datatypes::FieldRef> {
        Ok(args.arg_fields[0]
            .as_ref()
            .clone()
            .with_metadata(HashMap::from([(
                "ARROW:extension:metadata".to_string(),
                "foofy.foofy".to_string(),
            )]))
            .into())
    }

    fn invoke_with_args(
        &self,
        args: datafusion::logical_expr::ScalarFunctionArgs,
    ) -> datafusion::error::Result<datafusion::logical_expr::ColumnarValue> {
        Ok(args.args[0].clone())
    }
}

Expected behavior

I would have expected the field metadata (if identical for all items) to be propagated to the schema of the values expression. This does bring the complexity of type equality, but byte-for-byte hash map equality should be safe. A "user defined extension type" (if there ever is one) could define a more lenient equality checker (e.g., JSON object metadata equality for extension types whose serialization is JSON).

Additional context

cc @timsaucer (😬 ...I can help with these, I'm just in the process of sorting through test failures and want to make sure anything we find is reported!)

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions