Refactor setup.py for lazy loading and build optimization #3024

petrex · 2025-09-18T00:54:20Z

TLDR: Refactor setup.py for lazy loading and build optimization

This pull request refactors the setup.py build process to significantly reduce import overhead and improve build-time performance, especially for users who do not need to build C++/CUDA extensions. The main strategy is to defer heavy imports (like torch and torch.utils.cpp_extension) and submodule checks until they are actually needed at build time, rather than at import time. Additionally, the extension discovery and build logic is restructured for efficiency and maintainability.

Key changes include:

Build-time Import Optimization:

Defers heavy imports (such as torch and torch.utils.cpp_extension) from the top-level module scope to the specific functions or build steps where they are actually required. This reduces unnecessary overhead when running non-build commands or simply importing the package. [1] [2] [3] [4]

Custom Build Extension Logic:

Introduces a new LazyTorchAOBuildExt class that inherits from setuptools's build_ext and dynamically morphs into the real BuildExtension from torch.utils.cpp_extension only when building extensions. This ensures that submodule checks and extension discovery happen only when necessary. [1] [2] [3]
Moves submodule checks and extension discovery from import time to build time, further reducing unnecessary work for non-build operations.

Extension Discovery and Build Improvements:

Defers extension discovery (get_extensions()) to build time rather than at setup import, and initializes ext_modules as an empty list in the setup() call.
Improves logic for locating the torch CMake directory by using importlib.util.find_spec instead of distutils.sysconfig.get_python_lib, making it more robust and future-proof.

Debugging and Verbosity Enhancements:

Adds more granular and conditional debug print statements throughout the build process, controlled by the VERBOSE_BUILD environment variable or debug mode. This helps with diagnosing build issues without cluttering normal output. [1] [2] [3]

Package Discovery and Metadata:

Changes find_packages to only include torchao* packages, making package discovery more precise.

These changes together make the build process faster and less error-prone for users who do not need to build extensions, while retaining full build functionality for those who do.

- Introduced lazy imports for heavy dependencies like `torch` and `torch.utils.cpp_extension` to reduce initial import overhead. - Replaced the existing `TorchAOBuildExt` class with `LazyTorchAOBuildExt` to defer submodule checks and extension discovery until build time. - Updated the `setup()` function to set `ext_modules` to an empty list, deferring extension discovery for performance improvements. - Enhanced debug output for build processes based on environment variables. This refactor aims to streamline the build process and improve performance during package setup.

pytorch-bot · 2025-09-18T00:54:24Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3024

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 5 New Failures

As of commit d0e86c2 with merge base 9e5059e ():

NEW FAILURES - The following jobs have failed:

Code Analysis with Ruff / build (3.9) (gh)
Process completed with exit code 1.
Run Regression Tests / test (CUDA 2.6, linux.g5.12xlarge.nvidia.gpu, torch==2.6.0, cuda, 12.6) / linux-job (gh)
test/test_ops.py::TestOps::test_quant_llm_linear_ebits_3_mbits_2_float16
Run Regression Tests / test (CUDA 2.7, linux.g5.12xlarge.nvidia.gpu, torch==2.7.0, cuda, 12.6) / linux-job (gh)
test/test_ops.py::TestOps::test_quant_llm_linear_ebits_3_mbits_2_float16
Run Regression Tests / test (CUDA 2.8, linux.g5.12xlarge.nvidia.gpu, torch==2.8.0, cuda, 12.6) / linux-job (gh)
test/test_ops.py::TestOps::test_quant_llm_linear_ebits_3_mbits_2_float16
Run Regression Tests / test-nightly (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch --index-url https://downloa... / linux-job (gh)
test/test_ops.py::TestOps::test_quant_llm_linear_ebits_3_mbits_2_float16

This comment was automatically generated by Dr. CI and updates every 15 minutes.

- Added blank lines for improved separation of code blocks. - Reformatted list comprehensions for better clarity. - Adjusted line breaks in function calls to enhance readability. These changes aim to make the code more maintainable and easier to navigate.

msaroufim · 2025-09-18T01:04:36Z

@petrex what kinds of build time speedups are you seeing? The import time for ao is indeed insane and our setup.py has gotten really complicated in the last year and could use some love

petrex · 2025-09-18T03:48:41Z

Quick test in my env I am seeing 0.5–1.5s faster cold start.
This is coming from :

Avoided torch import
Avoided recursive source/glob scans
Avoided submodule checks unless actually building.

Build time itself is unchanged; only pre-build overhead is reduced.

petrex · 2025-09-18T03:49:23Z

looks like I accidentally break some other things, let me look into that later this week.

…kflow - Introduced the `PIP_NO_BUILD_ISOLATION` environment variable in the `build_wheels_linux.yml` workflow to ensure that the `setuptools` installed in the `pre_build_script.sh` is accessible during the build process. This change aims to improve the build process by allowing the use of the correct version of `setuptools` without isolation issues.

- Removed the `PIP_NO_BUILD_ISOLATION` environment variable from the `build_wheels_linux.yml` workflow. - Added the `PIP_NO_BUILD_ISOLATION` export to the `env_var_script_linux.sh` to ensure pre-installed tools are accessible during the build process. These changes aim to streamline the build environment and maintain consistency in the usage of environment variables.

- Added a check to ensure that auditwheel is only executed if the wheel contains at least one shared object (.so) file. - Included a message to indicate when auditwheel is skipped due to the absence of shared libraries. - Updated the wheel removal command to use `rm -f` for safer deletion. These changes improve the robustness of the post-build process by preventing unnecessary execution of auditwheel.

- Updated the script to determine the original wheel file produced by the build process, ensuring that the correct wheel is used for auditwheel operations. - Changed the wheel installation command to select the most recent wheel file from the distribution directory. - Enhanced logging messages to reflect the changes in wheel handling. These modifications enhance the reliability and clarity of the post-build process.

petrex · 2025-09-18T18:55:09Z

hey @msaroufim

Seeing this in CI, is that something you could help?
torchao::unpack_tensor_core_tiled_layout' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions.

- Updated the logic to skip CUDA extension compilation only if both CUDA_HOME is unset and nvcc is not found, improving compatibility with CI environments. - Adjusted the condition for building CUDA extensions to check for the presence of either CUDA_HOME or nvcc. These changes aim to provide clearer messaging and better support for CUDA extension compilation in various environments.

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 18, 2025

petrex added topic: not user facing Use this tag if you don't want this PR to show up in release notes enhancement New feature or request labels Sep 18, 2025

jerryzh168 requested review from vkuzo and msaroufim September 18, 2025 00:56

petrex added 3 commits September 17, 2025 21:01

petrex self-assigned this Sep 18, 2025

petrex mentioned this pull request Sep 19, 2025

Optimize Load Time #3009

Closed

petrex added 2 commits September 22, 2025 21:31

lint

d0e86c2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor setup.py for lazy loading and build optimization #3024

Refactor setup.py for lazy loading and build optimization #3024

Uh oh!

petrex commented Sep 18, 2025

Uh oh!

pytorch-bot bot commented Sep 18, 2025 •

edited

Loading

Uh oh!

msaroufim commented Sep 18, 2025

Uh oh!

petrex commented Sep 18, 2025

Uh oh!

petrex commented Sep 18, 2025

Uh oh!

petrex commented Sep 18, 2025

Uh oh!

Uh oh!

Refactor setup.py for lazy loading and build optimization #3024

Are you sure you want to change the base?

Refactor setup.py for lazy loading and build optimization #3024

Uh oh!

Conversation

petrex commented Sep 18, 2025

Uh oh!

pytorch-bot bot commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3024

❌ 5 New Failures

Uh oh!

msaroufim commented Sep 18, 2025

Uh oh!

petrex commented Sep 18, 2025

Uh oh!

petrex commented Sep 18, 2025

Uh oh!

petrex commented Sep 18, 2025

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 18, 2025 •

edited

Loading