[CI] Modify accelerate and transformers tests #1999

mengfei25 · 2025-09-01T03:42:06Z

enable test in container
use local python instead of conda
enable pytest parallel run and continue-if-cash
use pytest-xdist to parallelize tests instead of pytest-shard on a 8 cards system
all tests on rolling driver

test accelerate and transformers only
disable_build
disable_ut
disable_distributed

…l/torch-xpu-ops into mengfeil/modify-extra-tests

.github/workflows/_linux_transformers.yml

…l/torch-xpu-ops into mengfeil/modify-extra-tests

chuanqi129 · 2025-09-02T14:59:08Z

Hi @dvrogozh , we're working on torch-xpu-ops CI/CD workflows refactor, mainly including 2 aspects:

Containerized build and test, please refer [CI] Refactor CICD test workflows #1862
Parallelized UT tests by using pytest-xdist, please refer [CI] Enable pytest parallel run #1966

Those changes have a lot of benefits:

Standardize build and tests
Reduce single UT test job time cost
No Conda dependency, use setup-python in containers directly, test env totally isolated
Simplify the runner maintain effort, don't need to split one node as multiple runners
No change for matrix tests support

Please help to review this PR, after this PR land, we'll remove all single card runners with linux.idc.xpu label

chuanqi129 · 2025-09-02T15:00:57Z

By the way, the accelerate test has 3 failures, I guess they are not related with this PR changes

chuanqi129 · 2025-09-02T15:11:25Z

@mengfei25 please check the transformers test, there is no case to be executed actually.

dvrogozh

Transformers tests don't actually run anything after this change. I don't know the rootcause, but this needs to be debugged and fixed.

Important. Be extremely careful with parallelizing transformers tests. In the currently executing version parallelism of the tests is essentially switched off (see max parallel jobs settings set to 1). That was done for the reason that as seen as we are trying to parallelize we step into networking issues either on our side or on huggingface hub side. Last time we failed to overcome the problem and max parallel jobs was set to 1. The martix is still used to break the whole test into smaller chunks each around ~30mins or less. This allows to rerun smaller portion of the test on the failure instead of rerunning the whole suite.

.github/workflows/_linux_accelerate.yml

.github/workflows/_linux_transformers.yml

dvrogozh · 2025-09-02T20:42:49Z

.github/workflows/_linux_accelerate.yml

+    steps:
+      - name: Cleanup workspace
+        run: |
+          sudo find ./ |grep -v "^\./$" |xargs sudo rm -rf


This looks weird. Basically that's a hack since github actions are supposed to take care of that entirely. Effectively you increase a fetch time for the repos as you are deleting everything each time as it seems. It seems there is some mess with access rights. Can this be fixed?

Our CI has potential permission issue for WORKSPACE

A clean WORKSPACE is important for CI to make sure correct

Source code is lightweight, clone uses very less time

See proposal on the resolution here: https://github.com/intel/torch-xpu-ops/pull/1999/files#r2330443890. We need to discuss if this can be done within this PR or later on.

Removed, will rm permissioned files in host manually if meets in the future

dvrogozh · 2025-09-02T20:47:58Z

By the way, the accelerate test has 3 failures, I guess they are not related with this PR changes

These are new failures it seems. I see some test failures in this weekly as well and the a week old weekly was clean.

.github/workflows/_linux_accelerate.yml

dvrogozh

@mengfei25, the errors in transformers and accelerate tests look unrelated to this PR and correlate with the current version of the test. I am impressed by the reduction of the transformers test execution time. I was afraid that we will again step into networking issue, but test results indicate that we did not.

However, I have 2 requests before merging this test:

There are still few not answered questions from me in this PR. Can you, please, go thru them and reply?
I would like to be on the safe side making sure we really can parallelize transforms side. I suggest to force rerun test few times to verify that we don't step into the problem. We can do that once we will finish discussion on the questions not yet closed.

dvrogozh · 2025-09-08T01:41:33Z

.github/workflows/_linux_accelerate.yml

+    runs-on: ${{ needs.prepare.outputs.runner_id }}
+    needs: prepare
+    container:
+      image: mengfeili/intel-pvc-driver:1146-1136


Is using image under user account (mengfeili) per-design? or left-over from the debug?

No clean public image with driver only, so uses my image temporarily, consider to push such image to official hub later

@dvrogozh Commented all your questions, please help review.

mengfei25 · 2025-09-08T02:06:24Z

@dvrogozh Please help review it

dvrogozh · 2025-09-08T14:32:13Z

.github/workflows/_linux_accelerate.yml

+      image: mengfeili/intel-pvc-driver:1146-1136
+      volumes:
+        - ${{ github.workspace }}:${{ github.workspace }}
+      options: --device=/dev/mem --device=/dev/dri --group-add video --security-opt seccomp=unconfined --cap-add=SYS_PTRACE --shm-size=8g


Groups owning the device might differ depending on the OS. For modern Ubuntu that's video owning /dev/dri/cardN and render owning /dev/dri/renderDN. In our workflows we don't typically use /dev/dri/cardN and rely on /dev/dri/renderDN. This means that it's important to allow access to render group at the first place, but you have video instead. I suspect that render is hiding under needs.prepare.outputs.render_id. Is that right?

I think that this part of CI setup needs to be polished. I suggest you need the following:

Pass render and video explicitly here, i.e. options: --group-add render --group-add video ...

-u user:group is special. It's a key to managing permissions and is a reply to the above question (why you need sudo rm -rf to clean workspace). To fix that I suggest that you need to pass a user:group matching credentials of the user under which the runner runs. Thus should fix file permission issues. And you pass additional groups to access devices in addition explicitly.

Good point, fixed

Already with the runner user, removed the rm before checkout.

dvrogozh

I am not ok with the last changes in the PR. A lot of new changes appeared all of the sudden which require review the PR anew while yesterday version was almost ready to go with few minor changes. We need align again.

dvrogozh · 2025-09-10T17:17:34Z

.github/workflows/_linux_transformers.yml

          cd transformers
-          python3 -m pytest -rsf --make-reports=$TEST_CASE --junit-xml=reports/$TEST_CASE.xml \
-            -k "${{ matrix.test.filter}}" ${{ matrix.test.cmd }} || true
+          xpu_list=($(echo ${ZE_AFFINITY_MASK} |sed 's/,/ /g'))


These changes did not exist just yesterday. Why do we need them? I thought we have agreed that previous version was almost ready to go. I definitely has concerns with this new version. For example, output table became much bigger and complex: https://github.com/intel/torch-xpu-ops/actions/runs/17613758986?pr=1999. We don't need all these categories...

dvrogozh · 2025-09-10T17:18:57Z

.github/actions/get-runner/action.yml

-          if(gpu==1 && $0~/Platform/){gpu=0}; if(gpu==1){print $0}; if($0~/Platform.*Graphics/){gpu=1}
-        }' |wc -l)"
-        cpus_per_xpu="$(echo |awk -v c="${cpu_num}" -v x="${xpu_num}" '{printf c/x}')"
+        # available gpu card number


These changes were not here yesterday. Are they absolutely necessary? I also see that few other workloads besides accelerate and transformers got changed. Is this necessary as well?

mengfei25 and others added 3 commits September 1, 2025 10:26

modify accelerate tests

77c7744

modify transformers tests

4b6013a

Merge branch 'main' into mengfeil/modify-extra-tests

c8cc64c

mengfei25 requested review from dvrogozh and chuanqi129 September 1, 2025 03:42

mengfei25 added 2 commits September 1, 2025 11:46

update

00aa549

Merge branch 'mengfeil/modify-extra-tests' of https://github.com/inte…

42e6541

…l/torch-xpu-ops into mengfeil/modify-extra-tests

dvrogozh requested changes Sep 1, 2025

View reviewed changes

.github/workflows/_linux_transformers.yml Outdated Show resolved Hide resolved

mengfei25 added 2 commits September 1, 2025 15:22

split transformers test jobs

2de397e

update

782a62c

mengfei25 requested a review from dvrogozh September 1, 2025 07:42

mengfei25 and others added 12 commits September 1, 2025 17:18

update

4df28d5

update

cbc00e3

Merge branch 'main' into mengfeil/modify-extra-tests

bec7bb1

update

29705c1

Merge branch 'mengfeil/modify-extra-tests' of https://github.com/inte…

50c015e

…l/torch-xpu-ops into mengfeil/modify-extra-tests

update

6727196

update

10c6152

update

5663c85

update

081fad6

update

608c5c6

Merge branch 'main' into mengfeil/modify-extra-tests

5e9a5be

update

12d8646

chuanqi129 approved these changes Sep 2, 2025

View reviewed changes

chuanqi129 self-requested a review September 2, 2025 15:12

dvrogozh requested changes Sep 2, 2025

View reviewed changes

dvrogozh reviewed Sep 3, 2025

View reviewed changes

.github/workflows/_linux_accelerate.yml Outdated Show resolved Hide resolved

mengfei25 and others added 5 commits September 3, 2025 10:28

update

638762d

accelerate only need 1 card

c3275a3

update

0303d81

update

824cfef

Merge branch 'main' into mengfeil/modify-extra-tests

0a159d9

mengfei25 requested a review from dvrogozh September 3, 2025 09:59

mengfei25 added 2 commits September 6, 2025 16:14

Merge branch 'main' into mengfeil/modify-extra-tests

234985e

Merge branch 'main' into mengfeil/modify-extra-tests

379cf35

dvrogozh reviewed Sep 8, 2025

View reviewed changes

intel deleted a comment from chuanqi129 Sep 8, 2025

mengfei25 requested a review from dvrogozh September 8, 2025 06:42

dvrogozh reviewed Sep 8, 2025

View reviewed changes

mengfei25 added 2 commits September 9, 2025 14:31

modify container args

b456364

remove workspace cleanup before checkout

856bd09

mengfei25 requested a review from dvrogozh September 9, 2025 06:56

mengfei25 and others added 6 commits September 10, 2025 16:44

modify ZE_AFFINITY_MASK in container

462b387

transformers mutli shards

9b49c07

update

fc3ac8e

Merge branch 'main' into mengfeil/modify-extra-tests

dda9f7d

set numactl to distribute CPUs

b47b279

fix lint

344d370

dvrogozh requested changes Sep 10, 2025

View reviewed changes

[CI] Modify accelerate and transformers tests #1999

Are you sure you want to change the base?

[CI] Modify accelerate and transformers tests #1999

Uh oh!

Conversation

mengfei25 commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

chuanqi129 commented Sep 2, 2025

Uh oh!

chuanqi129 commented Sep 2, 2025

Uh oh!

chuanqi129 commented Sep 2, 2025

Uh oh!

dvrogozh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mengfei25 Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dvrogozh commented Sep 2, 2025

Uh oh!

Uh oh!

dvrogozh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mengfei25 Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mengfei25 commented Sep 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dvrogozh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mengfei25 commented Sep 1, 2025 •

edited

Loading

mengfei25 Sep 9, 2025 •

edited

Loading

mengfei25 Sep 8, 2025 •

edited

Loading