fix: Disable Process UDFs for Python Dtypes #5085

srilman · 2025-08-29T19:13:17Z

Changes Made

With Python-dtype columns, we can't always guarantee that the values in the column are serializable / picklable to send to another process. So for the time being, lets force these types to run on the same thread.

In the future, we should consider:

Enforce or strongly suggest this requirement, since it may cause issues in distributed
Build a method to actually serialize Python arrays (Arrow doesn't support this)
Potentially try the process approach first, and if it fails, fallback to threads

Checklist

Documented in API Docs (if applicable)
Documented in User Guide (if applicable)
If adding a new documentation page, doc is added to docs/mkdocs.yml navigation
Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)

…-python

srilman · 2025-09-02T18:45:46Z

daft/execution/udf_worker.py

        try:
-            conn.send((_ERROR, TracebackException.from_exception(e)))
+            tb = "\n".join(TracebackException.from_exception(e).format())


I have no idea why this is now an issue and wasn't before, but I'm not going to investigate this too much right now since it seems to only break in Python 3.9 and 3.10, and since 3.9 should be EOF in a month, its probably OK.

greptile-apps

Greptile Summary

This PR implements a safety mechanism to prevent UDFs (User Defined Functions) from using process-based execution when dealing with Python-dtype columns. The core issue addressed is that Python-dtype columns contain arbitrary Python objects (like lambda functions or custom classes) that cannot be guaranteed to be serializable/picklable, which is required for inter-process communication in the UDF execution system.

The changes span multiple layers of the codebase:

Rust Logic Layer: Modified src/daft-logical-plan/src/ops/udf.rs to add validation that prevents setting use_process=True when Python dtypes are detected in either input columns or output projections. This provides early error detection with clear error messages.

Execution Layer: Updated src/daft-local-execution/src/intermediate_ops/udf.rs to automatically detect non-Arrow dtypes (including Python dtypes) using the dtype.is_arrow() method and force thread-based execution as the default behavior. The pipeline creation code in src/daft-local-execution/src/pipeline.rs was updated to pass input schema information to enable this detection.

Python Layer: Enhanced error handling in daft/execution/udf_worker.py to make exception serialization more robust with fallback mechanisms, and fixed a string formatting bug in daft/execution/udf.py.

Documentation: Added clarifying documentation to the is_arrow() method in src/daft-schema/src/dtype.rs to explain its role in determining datatype compatibility with Arrow format.

Testing: Added comprehensive test coverage in tests/expressions/test_legacy_udf.py to verify that UDFs with Python-dtype inputs work correctly with thread-based execution but properly fail with clear error messages when forced to use process-based execution.

This change maintains backward compatibility while preventing runtime serialization failures, with the UDF system automatically falling back to safer thread-based execution for Python dtypes while still allowing process-based execution for Arrow-compatible types when beneficial.

Confidence score: 4/5

This PR addresses a real technical limitation with clear, well-implemented solutions across multiple system layers
Score reflects the comprehensive nature of changes affecting both Rust and Python components, requiring careful coordination
Pay close attention to the UDF execution logic changes in src/daft-local-execution/src/intermediate_ops/udf.rs and validation in src/daft-logical-plan/src/ops/udf.rs

_{7 files reviewed, no comments}

_{Edit Code Review Bot Settings | Greptile}

srilman · 2025-09-02T20:56:22Z

daft/execution/udf.py

@@ -88,7 +88,7 @@ def __init__(self, project_expr: PyExpr, passthrough_exprs: list[PyExpr]) -> Non
        expr_projection = ExpressionsProjection(
            [Expression._from_pyexpr(expr) for expr in passthrough_exprs] + [Expression._from_pyexpr(project_expr)]
        )
-        expr_projection_bytes = daft.pickle.dumps(expr_projection)
+        expr_projection_bytes = daft.pickle.dumps((project_expr.name(), expr_projection))


I realized this is the wrong name (this is the name for the expression, not the UDF), but since we're on a crunch. I fixed it in the next UDF PR that I need to merge today to avoid rerunning CI

If I need to address any comments, will modify then

colin-ho · 2025-09-02T21:20:07Z

src/daft-logical-plan/src/ops/udf.rs

+        // and that use_process != true
+        #[cfg(feature = "python")]
+        {
+            if matches!(udf_properties.use_process, Some(true)) {


Lets check for actor pool as well

kevinzwang · 2025-09-02T21:21:59Z

@srilman I'm honestly fine with only allowing UDFs to return serializable values. With our v0.6 version bump this might be a good time to start enforcing this, instead of building a workaround when people return unserializable values. What do you think?

srilman · 2025-09-02T21:23:55Z

@srilman I'm honestly fine with only allowing UDFs to return serializable values. With our v0.6 version bump this might be a good time to start enforcing this, instead of building a workaround when people return unserializable values. What do you think?

I agree, I think so too, but Arrow2 actually can't serialize Python arrays across processes. So if you used an actor-pool UDF with Python inputs / outputs, it wouldn't work. So I decided to just adhere to that behavior and later on build a method to serialize Python arrays. Then on v0.7 we can enforce this.

colin-ho · 2025-09-02T22:25:02Z

src/daft-logical-plan/src/ops/udf.rs

+                {
+                    return Err(Error::CreationError {
+                        source: DaftError::InternalError(
+                            format!("UDF `{}` can not set `use_process=True` because it has a Python-dtype input column `{}`. Please unset `use_process` or cast the input to a non-Python dtype if possible.", udf_properties.name, col.name)


Suggested change

format!("UDF `{}` can not set `use_process=True` because it has a Python-dtype input column `{}`. Please unset `use_process` or cast the input to a non-Python dtype if possible.", udf_properties.name, col.name)

format!("UDF `{}` cannot set `use_process=True` because it has a Python-dtype input column `{}`. Please unset `use_process` or cast the input to a non-Python dtype if possible.", udf_properties.name, col.name)

If you don't mind, I'm going to address these 2 problems in the next PR (the serialize PR) to reduce CI contention on MacOS runners

colin-ho · 2025-09-02T22:28:18Z

tests/expressions/test_legacy_udf.py

+    reason="Ray runner will always run UDFs on separate processes",
+)
+def test_udf_python_dtype():
+    """Test that running a UDF with a Python-dtype output column runs on the same process.


This test is checking for python input column, not output column. Can you add another test for output column

srilman added 8 commits August 19, 2025 15:33

Simpler for the time being

ebefd5e

Use tree repr instead

2f38f41

I think i see the problem

eb2979d

Merge branch 'main' into slade/split-filter-udfs

6331fc4

Add test

18a91dc

Merge branch 'main' into slade/split-filter-udfs

feca517

Fix a later bug

12e1a39

Disabled for Python dtypes

7265f37

github-actions bot added the fix label Aug 29, 2025

srilman added 4 commits August 29, 2025 12:41

Address comments

7918085

Undid the removal

a06887f

Merge branch 'main' into slade/disable-process-for-python

5dc35a5

Merge branch 'slade/split-filter-udfs' into slade/disable-process-for…

4b643bb

…-python

srilman changed the base branch from main to slade/split-filter-udfs August 29, 2025 21:50

Base automatically changed from slade/split-filter-udfs to main August 29, 2025 22:54

srilman added 7 commits August 29, 2025 16:14

Merge branch 'main' into slade/disable-process-for-python

c24f916

Merge branch 'main' into slade/disable-process-for-python

60b8f3d

Address the issue

d35499f

Merge branch 'main' into slade/disable-process-for-python

f439418

Fix CI

3e5d66e

Try this

529b59d

Good for the time being

d0b1e0c

srilman commented Sep 2, 2025

View reviewed changes

srilman marked this pull request as ready for review September 2, 2025 18:49

srilman requested review from colin-ho and kevinzwang September 2, 2025 18:49

greptile-apps bot reviewed Sep 2, 2025

View reviewed changes

Should be working now

3a0a5a3

srilman commented Sep 2, 2025

View reviewed changes

colin-ho reviewed Sep 2, 2025

View reviewed changes

kevinzwang approved these changes Sep 2, 2025

View reviewed changes

OK please be ok now

df76e8f

srilman requested a review from colin-ho September 2, 2025 22:03

srilman enabled auto-merge (squash) September 2, 2025 22:03

colin-ho approved these changes Sep 2, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Disable Process UDFs for Python Dtypes #5085

fix: Disable Process UDFs for Python Dtypes #5085

Uh oh!

srilman commented Aug 29, 2025 •

edited

Loading

Uh oh!

srilman Sep 2, 2025 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

srilman Sep 2, 2025 •

edited

Loading

Uh oh!

srilman Sep 2, 2025

Uh oh!

colin-ho Sep 2, 2025

Uh oh!

kevinzwang commented Sep 2, 2025

Uh oh!

srilman commented Sep 2, 2025

Uh oh!

colin-ho Sep 2, 2025

Uh oh!

srilman Sep 2, 2025

Uh oh!

colin-ho Sep 2, 2025

Uh oh!

Uh oh!

	format!("UDF `{}` can not set `use_process=True` because it has a Python-dtype input column `{}`. Please unset `use_process` or cast the input to a non-Python dtype if possible.", udf_properties.name, col.name)
	format!("UDF `{}` cannot set `use_process=True` because it has a Python-dtype input column `{}`. Please unset `use_process` or cast the input to a non-Python dtype if possible.", udf_properties.name, col.name)

fix: Disable Process UDFs for Python Dtypes #5085

Are you sure you want to change the base?

fix: Disable Process UDFs for Python Dtypes #5085

Uh oh!

Conversation

srilman commented Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes Made

Checklist

Uh oh!

srilman Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Summary

Confidence score: 4/5

Uh oh!

srilman Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srilman Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

colin-ho Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

kevinzwang commented Sep 2, 2025

Uh oh!

srilman commented Sep 2, 2025

Uh oh!

colin-ho Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

srilman Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

colin-ho Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

srilman commented Aug 29, 2025 •

edited

Loading

srilman Sep 2, 2025 •

edited

Loading

srilman Sep 2, 2025 •

edited

Loading