Fix TPU test CI #14926

carmocca · 2022-09-28T18:55:32Z

What does this PR do?

Properly set the PR number
Clean up logs
Make fetching faster

Part of #14550

Does your PR introduce any breaking changes? If yes, please list them.

None

cc @carmocca @akihironitta @Borda @kaushikb11 @rohitgr7

awaelchli

You added Lite 🤩

awaelchli · 2022-09-28T19:55:45Z

I need to look into the pickle issue ... We can backtrack changes we made to the multiprocess launcher recently, hmmm.

src/pytorch_lightning/strategies/launchers/xla.py

for more information, see https://pre-commit.ci

dockers/tpu-tests/tpu_test_cases.jsonnet

carmocca · 2022-09-30T14:03:13Z

Debugging recap:

mkl==2021.4.0, the version we use in CI currently, does have the .so inside

find / -name 'libmkl*' | grep libmkl_intel_lp64.so.1
/root/miniconda3/envs/lightning/lib/libmkl_intel_lp64.so.1

and python -c "import torch" works.
yet the OSError still appears in the spawn test:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/root/miniconda3/envs/lightning/lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "/root/miniconda3/envs/lightning/lib/python3.7/multiprocessing/spawn.py", line 115, in _main
    self = reduction.pickle.load(from_parent)
  File "/root/miniconda3/envs/lightning/lib/python3.7/site-packages/torch/__init__.py", line 201, in <module>
    _load_global_deps()
  File "/root/miniconda3/envs/lightning/lib/python3.7/site-packages/torch/__init__.py", line 154, in _load_global_deps
    ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
  File "/root/miniconda3/envs/lightning/lib/python3.7/ctypes/__init__.py", line 364, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libmkl_intel_lp64.so.1: cannot open shared object file: No such file or directory
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/root/miniconda3/envs/lightning/lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "/root/miniconda3/envs/lightning/lib/python3.7/multiprocessing/spawn.py", line 115, in _main
    self = reduction.pickle.load(from_parent)
  File "/root/miniconda3/envs/lightning/lib/python3.7/site-packages/torch/__init__.py", line 201, in <module>
    _load_global_deps()
  File "/root/miniconda3/envs/lightning/lib/python3.7/site-packages/torch/__init__.py", line 154, in _load_global_deps
    ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
  File "/root/miniconda3/envs/lightning/lib/python3.7/ctypes/__init__.py", line 364, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libmkl_intel_lp64.so.1: cannot open shared object file: No such file or directory

Lastest mkl doesn't even provide the dll (find doesn't show it). Just python -c "import torch" triggers the OSError

So I don't think we can switch change the mkl version.

torch_xla installs mkl-include in their CI: https://github.com/pytorch/xla/blob/3eddd035a1d6270b6b07902ffeb84f2af486196e/scripts/build_torch_wheels.sh#L182 but I didn't see a difference when doing the same.

Worst case we just merge this with the skipped TPU test.

awaelchli · 2022-09-30T15:27:54Z

@carmocca I honestly think skipping that one test that is provoking this error is totally fine, as it is not very critical. Maybe one more thing one could do is clone the branch into Colab and run some tests there just as a sanity check manually. But I wouldn't spend too much time on it.

carmocca · 2022-09-30T15:36:30Z

It worries me a bit because the tests just fast-dev-run trains and tests a BoringModel, which is the simplest integration test we could have. We might have to skip more tests, I just don't know yet because getting one run to happen is really difficult, given the availability.

carmocca · 2022-10-01T16:02:45Z

I've run TPUs 10 times and cannot get them to provide a machine once. I propose we merge this as it is to at least fix the launching code and continue fixing tests separately.

* Fix TPU test CI * +x first * Lite first to uncovert errors faster * Fixes * One more * Simplify XLALauncher wrapping to avoid pickle error * debug * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Debug commit successful. Trying local definitions * Require tpu for mock test * ValueError: The number of devices must be either 1 or 8, got 4 instead * Fix mock test * Simplify call, rely on defaults * Skip OSError for now. Maybe upgrading will help * Simplify launch tests, move some to lite * Stricter typing * RuntimeError: Accessing the XLA device before processes have spawned is not allowed. * Revert "RuntimeError: Accessing the XLA device before processes have spawned is not allowed." This reverts commit f65107e. * Alternative boring solution to the reverted commit * Fix failing test on CUDA machine * Workarounds * Try latest mkl * Revert "Try latest mkl" This reverts commit d06813a. * Wrong exception * xfail * Mypy * Comment change * Spawn launch refactor * Accept that we cannot lazy init now * Fix mypy and launch test failures * The base dockerfile already includes mkl-2022.1.0 - what if we use it? * try a different mkl version * Revert mkl version changes Co-authored-by: awaelchli <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Akihiro Nitta <[email protected]>

* use more recent lightning cloud launcher * allow LightningApp to use custom cloud compute for flows * feedback from adrian * adjust other cloud tests * update * update * update commens * Update src/lightning_app/core/app.py Co-authored-by: Sherin Thomas <[email protected]> * Close profiler when `StopIteration` is raised (#14945) * Find last checkpoints on restart (#14907) Co-authored-by: Carlos Mocholí <[email protected]> * Remove unused gcsfs dependency (#14962) * Update hpu mixed precision link (#14974) Signed-off-by: Jerome <[email protected]> * Bump version of fsspec (#14975) fsspec verbump * Fix TPU test CI (#14926) * Fix TPU test CI * +x first * Lite first to uncovert errors faster * Fixes * One more * Simplify XLALauncher wrapping to avoid pickle error * debug * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Debug commit successful. Trying local definitions * Require tpu for mock test * ValueError: The number of devices must be either 1 or 8, got 4 instead * Fix mock test * Simplify call, rely on defaults * Skip OSError for now. Maybe upgrading will help * Simplify launch tests, move some to lite * Stricter typing * RuntimeError: Accessing the XLA device before processes have spawned is not allowed. * Revert "RuntimeError: Accessing the XLA device before processes have spawned is not allowed." This reverts commit f65107e. * Alternative boring solution to the reverted commit * Fix failing test on CUDA machine * Workarounds * Try latest mkl * Revert "Try latest mkl" This reverts commit d06813a. * Wrong exception * xfail * Mypy * Comment change * Spawn launch refactor * Accept that we cannot lazy init now * Fix mypy and launch test failures * The base dockerfile already includes mkl-2022.1.0 - what if we use it? * try a different mkl version * Revert mkl version changes Co-authored-by: awaelchli <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Akihiro Nitta <[email protected]> * Trainer: fix support for non-distributed PyTorch (#14971) * Trainer: fix non-distributed use * Update CHANGELOG * fixes typing errors in rich_progress.py (#14963) * revert default cloud compute rename * allow LightningApp to use custom cloud compute for flows * feedback from adrian * update * resolve merge with master conflict * remove preemptible * update CHANGELOG * add basic flow cloud compute documentation * fix docs build * add missing symlink * try to fix sphinx * another attempt for docs * fix new test Signed-off-by: Jerome <[email protected]> Co-authored-by: thomas chaton <[email protected]> Co-authored-by: Sherin Thomas <[email protected]> Co-authored-by: Ziyad Sheebaelhamd <[email protected]> Co-authored-by: otaj <[email protected]> Co-authored-by: Carlos Mocholí <[email protected]> Co-authored-by: Jerome Anand <[email protected]> Co-authored-by: awaelchli <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Akihiro Nitta <[email protected]> Co-authored-by: Adam J. Stewart <[email protected]> Co-authored-by: DP <[email protected]>

Fix TPU test CI

b42a40e

carmocca added ci Continuous Integration accelerator: tpu Tensor Processing Unit labels Sep 28, 2022

carmocca added this to the pl:1.8 milestone Sep 28, 2022

carmocca self-assigned this Sep 28, 2022

carmocca requested review from Borda, akihironitta and tchaton as code owners September 28, 2022 18:55

carmocca added 3 commits September 28, 2022 20:58

+x first

ebf46cd

Lite first to uncovert errors faster

5627c8b

Fixes

05627a2

carmocca requested review from williamFalcon, awaelchli, kaushikb11 and rohitgr7 as code owners September 28, 2022 19:14

github-actions bot added the pl Generic label for PyTorch Lightning package label Sep 28, 2022

One more

9cdb886

awaelchli approved these changes Sep 28, 2022

View reviewed changes

Simplify XLALauncher wrapping to avoid pickle error

684e257

carmocca requested a review from justusschock as a code owner September 28, 2022 20:04

carmocca commented Sep 28, 2022

View reviewed changes

src/pytorch_lightning/strategies/launchers/xla.py Outdated Show resolved Hide resolved

awaelchli and others added 5 commits September 28, 2022 23:14

debug

dc7dad8

[pre-commit.ci] auto fixes from pre-commit.com hooks

fc0bd77

for more information, see https://pre-commit.ci

Debug commit successful. Trying local definitions

cbcf684

Require tpu for mock test

2dd01bb

ValueError: The number of devices must be either 1 or 8, got 4 instead

d37217d

carmocca force-pushed the ci/fix-circleci branch from 2356a1f to d37217d Compare September 28, 2022 22:30

carmocca added 2 commits September 29, 2022 01:23

Fix mock test

a6d4505

Simplify call, rely on defaults

cefbeb5

carmocca added 3 commits September 30, 2022 00:56

Merge branch 'master' into ci/fix-circleci

6977f91

Fix mypy and launch test failures

a7d870a

The base dockerfile already includes mkl-2022.1.0 - what if we use it?

5ba7307

carmocca mentioned this pull request Sep 30, 2022

Refactor launching tests to use our launchers #14954

Merged

Merge branch 'master' into ci/fix-circleci

ca8748c

awaelchli mentioned this pull request Sep 30, 2022

More tests for TPU accelerator in Lite #14960

Merged

11 tasks

Merge branch 'master' into ci/fix-circleci

f1ad0d6

carmocca commented Sep 30, 2022

View reviewed changes

dockers/tpu-tests/tpu_test_cases.jsonnet Outdated Show resolved Hide resolved

carmocca force-pushed the ci/fix-circleci branch from e2c0533 to 08c1951 Compare September 30, 2022 11:51

try a different mkl version

4f5af35

carmocca force-pushed the ci/fix-circleci branch from 08c1951 to 4f5af35 Compare September 30, 2022 12:09

Revert mkl version changes

d095bcd

carmocca mentioned this pull request Sep 30, 2022

Migrate TPU tests to GitHub actions #14687

Merged

carmocca marked this pull request as ready for review September 30, 2022 19:04

carmocca requested a review from otaj as a code owner September 30, 2022 19:04

Merge branch 'master' into ci/fix-circleci

21aab4b

awaelchli approved these changes Oct 1, 2022

View reviewed changes

otaj approved these changes Oct 3, 2022

View reviewed changes

mergify bot added the ready PRs ready to be merged label Oct 3, 2022

lexierule merged commit 3028fd2 into master Oct 3, 2022

lexierule deleted the ci/fix-circleci branch October 3, 2022 13:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix TPU test CI #14926

Fix TPU test CI #14926

Uh oh!

carmocca commented Sep 28, 2022 •

edited

Loading

Uh oh!

awaelchli left a comment

Uh oh!

awaelchli commented Sep 28, 2022

Uh oh!

Uh oh!

Uh oh!

carmocca commented Sep 30, 2022

Uh oh!

awaelchli commented Sep 30, 2022

Uh oh!

carmocca commented Sep 30, 2022

Uh oh!

carmocca commented Oct 1, 2022

Uh oh!

Uh oh!

Fix TPU test CI #14926

Fix TPU test CI #14926

Uh oh!

Conversation

carmocca commented Sep 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Does your PR introduce any breaking changes? If yes, please list them.

Uh oh!

awaelchli left a comment

Choose a reason for hiding this comment

Uh oh!

awaelchli commented Sep 28, 2022

Uh oh!

Uh oh!

Uh oh!

carmocca commented Sep 30, 2022

Uh oh!

awaelchli commented Sep 30, 2022

Uh oh!

carmocca commented Sep 30, 2022

Uh oh!

carmocca commented Oct 1, 2022

Uh oh!

Uh oh!

carmocca commented Sep 28, 2022 •

edited

Loading