Skip to content

Conversation

malcolmgreaves
Copy link
Contributor

Changes Made

Adds a new daft.pydantic_integration module with a new function, daft_datatype_for. This function accepts a type and generates the appropriate daft.DataType for it. It is intended to be used in Python UDF's with the return_dtype=daft_datatype_for(...).

In addition to the basic types (str, int, etc.) and container types (list, dict), this function knows how to handle Pydantic BaseModel classes as well as classes decorated with @dataclass. Additionally, this function can handle such classes that have generic parameters that are filled-in.

Checklist

  • Documented in API Docs (if applicable)
  • Documented in User Guide (if applicable)
  • If adding a new documentation page, doc is added to docs/mkdocs.yml navigation
  • Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)

@github-actions github-actions bot added the feat label Sep 5, 2025
@malcolmgreaves
Copy link
Contributor Author

There should be some good documentation accompanying this new feature. Where's a good place to put this? @kevinzwang @desmondcheongzx

With this PR, I'll be able to update the document processing notebook example to import it and use -- https://colab.research.google.com/github/Eventual-Inc/Daft/blob/main/tutorials/document_processing/document_processing_tutorial.ipynb

greptile-apps[bot]

This comment was marked as outdated.

@malcolmgreaves
Copy link
Contributor Author

@greptileai

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. The recent changes have addressed the critical function name typo issue that was affecting the module's functionality - the function name daft_dataype_for has been corrected to daft_datatype_for throughout both the implementation file and test file. This was a significant fix as the typo would have prevented the function from being properly exported and used.

The module introduces a new daft.pydantic_integration module that provides automatic type conversion from Python types (including Pydantic BaseModel classes and dataclasses) to Daft DataType objects. This integration enables users to write return_dtype=daft_datatype_for(SomeModel) in their Python UDFs, which is particularly valuable for structured data processing workflows where users want to leverage both Pydantic's validation capabilities and Daft's data processing features.

The implementation handles various type scenarios including built-in types (str, int, float), container types (list, dict), Optional/Union types, and complex structured types with generics. For Pydantic models and dataclasses, it uses introspection to analyze the fields and their type annotations, recursively building the appropriate Daft struct types. The function integrates with the existing Daft type system and follows the framework's patterns for handling complex data structures.

Confidence score: 3/5

  • This PR has resolved the critical naming issue but still contains several logic problems that could affect functionality
  • Score reflects that while the typos are fixed, the underlying issues with nullable handling and datetime mapping remain unresolved
  • Pay close attention to the Optional/Union handling logic and datetime type mapping in the main implementation file

2 files reviewed, no comments

Edit Code Review Bot Settings | Greptile

@malcolmgreaves
Copy link
Contributor Author

And maybe there should be some integration tests here? 🤔 e.g. make a UDF using a Pydantic class, use this function for the return_dtype, run through with a small example, then verify it works as expected...

@malcolmgreaves
Copy link
Contributor Author

Also, I wasn't sure where to put this function. I figured its own module is good. Thoughts?

@kevinzwang
Copy link
Member

I might've missed it but does this PR add the functionality to actually turn a Pydantic BaseModel instance into Daft or just conversion of a BaseModel type into a Daft DataType? In other words, can you, say, return a BaseModel in a UDF? Otherwise, it's not particularly useful to pass it into return_dtype.

Also for the daft_datatype_for function, I believe if you implement it on DataType._infer_type, it should automatically work in return_dtype.

@malcolmgreaves
Copy link
Contributor Author

malcolmgreaves commented Sep 5, 2025 via email

raise e


def _daft_datatype_for(f_type: type[Any]) -> DataType:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a lot of this looks duplicated from _infer_dtype. Is there a way we could consolidate some of this logic?

Copy link
Contributor Author

@malcolmgreaves malcolmgreaves Sep 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's doable. But how would it work with the optional pydantic import? I could put in some import-checking code to skip over pydantic-related parts if pydantic isn't installed.

Also, that's a private function, is there a better place to put a public function? The intention of this PR is to make this functionality public to users. And IIUC users aren't supposed to call _infer_dtype, right? So how could they use this for their return_dtype= ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have conditional checks in there as well, such as pil.Image.

        elif (
            isinstance(user_provided_type, types.ModuleType)
            and hasattr(user_provided_type.module, "__name__")
            and user_provided_type.module.__name__ == "PIL.Image"
        ) or (pil_image.module_available() and user_provided_type is pil_image.Image):
            return DataType.image()

So I think it would be reasonable to add conditional checks in here for pydantic types.

as far as it being private, we do implicit conversions, so typically you can use anything that works with _infer_dtype without having to explicitly call it.

as such, the user shouldnt even need to manually perform the conversion -- just use their model directly.

class PydanticModel(...):
  pass
  
@daft.func
def my_func(text:str) -> PydanticModel:
  pass
  
# alternatively
@daft.func(return_dtype=PydanticModel)
def my_func(text):
  pass

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, thanks for pointing this mechanism out! I did not know this is how it worked. I can work on adapting this PR to fit into _infer_type.

__all__: Sequence[str] = ("daft_datatype_for",)


def daft_datatype_for(f_type: type[Any]) -> DataType:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically most of our conversion methods are from_. I'd prefer to keep this consistent with our other datatype conversion functions.

dt = DataType.from_arrow_type(arrow_dtype)
dt = DataType.from_numpy_dtype(np_type)
dt = DataType.from_pydantic(...)

Copy link
Contributor Author

@malcolmgreaves malcolmgreaves Sep 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would that pattern work for what I've implemented here?

Specifically, what I made here lets the user not need to upfront say "this is just for a pydantic data type" or "a list of pydantic datatpyes" etc. It looks at the type and then recursively descends its structure to determine what the daft.DataType is. That's why the signature is type[Any] -> daft.DataType. It's also necessary to have it be a type[Any] because its recursive: we're not guaranteed to only keep getting pydantic BaseModels as input as we descend into its structure.

IIUC having a DataType.from_pydantic wouldn't work for something that:

  • is a list[<pydantic basemodel>]
  • is a dict[<something>, <pydantic basemodel>]
  • is a pydantic basemodel that has attributes that are @dataclass decorated classes

Or, is there something I'm missing from your suggestion here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks at the type and then recursively descends its structure to determine what the daft.DataType is.

this is exactly what the current _infer_dtype does.

Copy link
Contributor Author

@malcolmgreaves malcolmgreaves Sep 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I didn't know that, thanks! But that's not accessible to users. So we need a new publicly available function for them to use. Do you have a suggestion on where that should go?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the most part, I expect users to just use their native types without having to convert to a DataType as mentioned here #5148 (comment).

But for cases where they want to (for whatever reason) convert to a datatype manually, I'd suggest just making _infer_dtype public.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I'd like to eventually have a public method for inferring the dtype so users can check the conversion logic without having to construct a UDF or whatnot. But that would mean that we have stabilized/unified our conversion logic from Python to Daft which we have not yet done.

@malcolmgreaves malcolmgreaves force-pushed the mg/pydantic_daft_datatype branch from 7e0ed76 to 946daa8 Compare September 5, 2025 20:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants