-
Notifications
You must be signed in to change notification settings - Fork 278
feat: return_dtype=daft_datatype_for(<pydantic basemodel>) #5148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There should be some good documentation accompanying this new feature. Where's a good place to put this? @kevinzwang @desmondcheongzx With this PR, I'll be able to update the document processing notebook example to import it and use -- https://colab.research.google.com/github/Eventual-Inc/Daft/blob/main/tutorials/document_processing/document_processing_tutorial.ipynb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Summary
This review covers only the changes made since the last review, not the entire PR. The recent changes have addressed the critical function name typo issue that was affecting the module's functionality - the function name daft_dataype_for
has been corrected to daft_datatype_for
throughout both the implementation file and test file. This was a significant fix as the typo would have prevented the function from being properly exported and used.
The module introduces a new daft.pydantic_integration
module that provides automatic type conversion from Python types (including Pydantic BaseModel classes and dataclasses) to Daft DataType objects. This integration enables users to write return_dtype=daft_datatype_for(SomeModel)
in their Python UDFs, which is particularly valuable for structured data processing workflows where users want to leverage both Pydantic's validation capabilities and Daft's data processing features.
The implementation handles various type scenarios including built-in types (str, int, float), container types (list, dict), Optional/Union types, and complex structured types with generics. For Pydantic models and dataclasses, it uses introspection to analyze the fields and their type annotations, recursively building the appropriate Daft struct types. The function integrates with the existing Daft type system and follows the framework's patterns for handling complex data structures.
Confidence score: 3/5
- This PR has resolved the critical naming issue but still contains several logic problems that could affect functionality
- Score reflects that while the typos are fixed, the underlying issues with nullable handling and datetime mapping remain unresolved
- Pay close attention to the Optional/Union handling logic and datetime type mapping in the main implementation file
2 files reviewed, no comments
And maybe there should be some integration tests here? 🤔 e.g. make a UDF using a Pydantic class, use this function for the |
Also, I wasn't sure where to put this function. I figured its own module is good. Thoughts? |
I might've missed it but does this PR add the functionality to actually turn a Pydantic BaseModel instance into Daft or just conversion of a BaseModel type into a Daft DataType? In other words, can you, say, return a BaseModel in a UDF? Otherwise, it's not particularly useful to pass it into return_dtype. Also for the |
No, that's not in here. I'd like to implement that as well. I need some
pointers on that. Right now it is translating things into arrow struct
types. So, Daft makes that a dictionary at runtime. Where in the code does
Daft actually convert the column value into this dict? If I can just
propagate the pydantic class type into this spot, then it will be trivial
to serialize and deserialize.
…On Thu, Sep 4, 2025 at 10:11 PM Kev Wang ***@***.***> wrote:
*kevinzwang* left a comment (Eventual-Inc/Daft#5148)
<#5148 (comment)>
I might've missed it but does this PR add the functionality to actually
turn a Pydantic BaseModel instance into Daft or just conversion of a
BaseModel type into a Daft DataType? In other words, can you, say, return a
BaseModel in a UDF? Otherwise, it's not particularly useful to pass it into
return_dtype.
Also for the daft_datatype_for function, I believe if you implement it on
DataType._infer_type, it should automatically work in return_dtype.
—
Reply to this email directly, view it on GitHub
<#5148 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADUXUIGBHO7B3EC75U43ET3RELOPAVCNFSM6AAAAACFV7SJ6CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTENJXGA4DKOBTGE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
raise e | ||
|
||
|
||
def _daft_datatype_for(f_type: type[Any]) -> DataType: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a lot of this looks duplicated from _infer_dtype. Is there a way we could consolidate some of this logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's doable. But how would it work with the optional pydantic import? I could put in some import-checking code to skip over pydantic-related parts if pydantic isn't installed.
Also, that's a private function, is there a better place to put a public function? The intention of this PR is to make this functionality public to users. And IIUC users aren't supposed to call _infer_dtype
, right? So how could they use this for their return_dtype=
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have conditional checks in there as well, such as pil.Image
.
elif (
isinstance(user_provided_type, types.ModuleType)
and hasattr(user_provided_type.module, "__name__")
and user_provided_type.module.__name__ == "PIL.Image"
) or (pil_image.module_available() and user_provided_type is pil_image.Image):
return DataType.image()
So I think it would be reasonable to add conditional checks in here for pydantic types.
as far as it being private, we do implicit conversions, so typically you can use anything that works with _infer_dtype
without having to explicitly call it.
as such, the user shouldnt even need to manually perform the conversion -- just use their model directly.
class PydanticModel(...):
pass
@daft.func
def my_func(text:str) -> PydanticModel:
pass
# alternatively
@daft.func(return_dtype=PydanticModel)
def my_func(text):
pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, thanks for pointing this mechanism out! I did not know this is how it worked. I can work on adapting this PR to fit into _infer_type
.
__all__: Sequence[str] = ("daft_datatype_for",) | ||
|
||
|
||
def daft_datatype_for(f_type: type[Any]) -> DataType: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typically most of our conversion methods are from_
. I'd prefer to keep this consistent with our other datatype conversion functions.
dt = DataType.from_arrow_type(arrow_dtype)
dt = DataType.from_numpy_dtype(np_type)
dt = DataType.from_pydantic(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would that pattern work for what I've implemented here?
Specifically, what I made here lets the user not need to upfront say "this is just for a pydantic data type" or "a list of pydantic datatpyes" etc. It looks at the type and then recursively descends its structure to determine what the daft.DataType
is. That's why the signature is type[Any] -> daft.DataType
. It's also necessary to have it be a type[Any]
because its recursive: we're not guaranteed to only keep getting pydantic BaseModels as input as we descend into its structure.
IIUC having a DataType.from_pydantic
wouldn't work for something that:
- is a
list[<pydantic basemodel>]
- is a
dict[<something>, <pydantic basemodel>]
- is a pydantic basemodel that has attributes that are
@dataclass
decorated classes
Or, is there something I'm missing from your suggestion here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks at the type and then recursively descends its structure to determine what the daft.DataType is.
this is exactly what the current _infer_dtype
does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I didn't know that, thanks! But that's not accessible to users. So we need a new publicly available function for them to use. Do you have a suggestion on where that should go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the most part, I expect users to just use their native types without having to convert to a DataType
as mentioned here #5148 (comment).
But for cases where they want to (for whatever reason) convert to a datatype manually, I'd suggest just making _infer_dtype
public.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I'd like to eventually have a public method for inferring the dtype so users can check the conversion logic without having to construct a UDF or whatnot. But that would mean that we have stabilized/unified our conversion logic from Python to Daft which we have not yet done.
7e0ed76
to
946daa8
Compare
Changes Made
Adds a new
daft.pydantic_integration
module with a new function,daft_datatype_for
. This function accepts a type and generates the appropriatedaft.DataType
for it. It is intended to be used in Python UDF's with thereturn_dtype=daft_datatype_for(...)
.In addition to the basic types (str, int, etc.) and container types (list, dict), this function knows how to handle Pydantic BaseModel classes as well as classes decorated with
@dataclass
. Additionally, this function can handle such classes that have generic parameters that are filled-in.Checklist
docs/mkdocs.yml
navigation