Skip to content

Conversation

srilman
Copy link
Contributor

@srilman srilman commented Aug 29, 2025

Changes Made

With Python-dtype columns, we can't always guarantee that the values in the column are serializable / picklable to send to another process. So for the time being, lets force these types to run on the same thread.

In the future, we should consider:

  • Enforce or strongly suggest this requirement, since it may cause issues in distributed
  • Build a method to actually serialize Python arrays (Arrow doesn't support this)
  • Potentially try the process approach first, and if it fails, fallback to threads

Checklist

  • Documented in API Docs (if applicable)
  • Documented in User Guide (if applicable)
  • If adding a new documentation page, doc is added to docs/mkdocs.yml navigation
  • Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)

@github-actions github-actions bot added the fix label Aug 29, 2025
@srilman srilman changed the base branch from main to slade/split-filter-udfs August 29, 2025 21:50
Base automatically changed from slade/split-filter-udfs to main August 29, 2025 22:54
try:
conn.send((_ERROR, TracebackException.from_exception(e)))
tb = "\n".join(TracebackException.from_exception(e).format())
Copy link
Contributor Author

@srilman srilman Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea why this is now an issue and wasn't before, but I'm not going to investigate this too much right now since it seems to only break in Python 3.9 and 3.10, and since 3.9 should be EOF in a month, its probably OK.

@srilman srilman marked this pull request as ready for review September 2, 2025 18:49
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR implements a safety mechanism to prevent UDFs (User Defined Functions) from using process-based execution when dealing with Python-dtype columns. The core issue addressed is that Python-dtype columns contain arbitrary Python objects (like lambda functions or custom classes) that cannot be guaranteed to be serializable/picklable, which is required for inter-process communication in the UDF execution system.

The changes span multiple layers of the codebase:

Rust Logic Layer: Modified src/daft-logical-plan/src/ops/udf.rs to add validation that prevents setting use_process=True when Python dtypes are detected in either input columns or output projections. This provides early error detection with clear error messages.

Execution Layer: Updated src/daft-local-execution/src/intermediate_ops/udf.rs to automatically detect non-Arrow dtypes (including Python dtypes) using the dtype.is_arrow() method and force thread-based execution as the default behavior. The pipeline creation code in src/daft-local-execution/src/pipeline.rs was updated to pass input schema information to enable this detection.

Python Layer: Enhanced error handling in daft/execution/udf_worker.py to make exception serialization more robust with fallback mechanisms, and fixed a string formatting bug in daft/execution/udf.py.

Documentation: Added clarifying documentation to the is_arrow() method in src/daft-schema/src/dtype.rs to explain its role in determining datatype compatibility with Arrow format.

Testing: Added comprehensive test coverage in tests/expressions/test_legacy_udf.py to verify that UDFs with Python-dtype inputs work correctly with thread-based execution but properly fail with clear error messages when forced to use process-based execution.

This change maintains backward compatibility while preventing runtime serialization failures, with the UDF system automatically falling back to safer thread-based execution for Python dtypes while still allowing process-based execution for Arrow-compatible types when beneficial.

Confidence score: 4/5

  • This PR addresses a real technical limitation with clear, well-implemented solutions across multiple system layers
  • Score reflects the comprehensive nature of changes affecting both Rust and Python components, requiring careful coordination
  • Pay close attention to the UDF execution logic changes in src/daft-local-execution/src/intermediate_ops/udf.rs and validation in src/daft-logical-plan/src/ops/udf.rs

7 files reviewed, no comments

Edit Code Review Bot Settings | Greptile

@@ -88,7 +88,7 @@ def __init__(self, project_expr: PyExpr, passthrough_exprs: list[PyExpr]) -> Non
expr_projection = ExpressionsProjection(
[Expression._from_pyexpr(expr) for expr in passthrough_exprs] + [Expression._from_pyexpr(project_expr)]
)
expr_projection_bytes = daft.pickle.dumps(expr_projection)
expr_projection_bytes = daft.pickle.dumps((project_expr.name(), expr_projection))
Copy link
Contributor Author

@srilman srilman Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized this is the wrong name (this is the name for the expression, not the UDF), but since we're on a crunch. I fixed it in the next UDF PR that I need to merge today to avoid rerunning CI

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I need to address any comments, will modify then

// and that use_process != true
#[cfg(feature = "python")]
{
if matches!(udf_properties.use_process, Some(true)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets check for actor pool as well

@kevinzwang
Copy link
Member

@srilman I'm honestly fine with only allowing UDFs to return serializable values. With our v0.6 version bump this might be a good time to start enforcing this, instead of building a workaround when people return unserializable values. What do you think?

@srilman
Copy link
Contributor Author

srilman commented Sep 2, 2025

@srilman I'm honestly fine with only allowing UDFs to return serializable values. With our v0.6 version bump this might be a good time to start enforcing this, instead of building a workaround when people return unserializable values. What do you think?

I agree, I think so too, but Arrow2 actually can't serialize Python arrays across processes. So if you used an actor-pool UDF with Python inputs / outputs, it wouldn't work. So I decided to just adhere to that behavior and later on build a method to serialize Python arrays. Then on v0.7 we can enforce this.

@srilman srilman requested a review from colin-ho September 2, 2025 22:03
@srilman srilman enabled auto-merge (squash) September 2, 2025 22:03
{
return Err(Error::CreationError {
source: DaftError::InternalError(
format!("UDF `{}` can not set `use_process=True` because it has a Python-dtype input column `{}`. Please unset `use_process` or cast the input to a non-Python dtype if possible.", udf_properties.name, col.name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
format!("UDF `{}` can not set `use_process=True` because it has a Python-dtype input column `{}`. Please unset `use_process` or cast the input to a non-Python dtype if possible.", udf_properties.name, col.name)
format!("UDF `{}` cannot set `use_process=True` because it has a Python-dtype input column `{}`. Please unset `use_process` or cast the input to a non-Python dtype if possible.", udf_properties.name, col.name)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't mind, I'm going to address these 2 problems in the next PR (the serialize PR) to reduce CI contention on MacOS runners

reason="Ray runner will always run UDFs on separate processes",
)
def test_udf_python_dtype():
"""Test that running a UDF with a Python-dtype output column runs on the same process.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is checking for python input column, not output column. Can you add another test for output column

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants