Skip to content

Conversation

mengfei25
Copy link
Contributor

@mengfei25 mengfei25 commented Sep 1, 2025

  1. enable test in container
  2. use local python instead of conda
  3. enable pytest parallel run and continue-if-cash
  4. use pytest-xdist to parallelize tests instead of pytest-shard on a 8 cards system
  5. all tests on rolling driver

test accelerate and transformers only
disable_build
disable_ut
disable_distributed

@mengfei25 mengfei25 requested a review from dvrogozh September 1, 2025 07:42
@chuanqi129
Copy link
Contributor

Hi @dvrogozh , we're working on torch-xpu-ops CI/CD workflows refactor, mainly including 2 aspects:

  1. Containerized build and test, please refer [CI] Refactor CICD test workflows #1862
  2. Parallelized UT tests by using pytest-xdist, please refer [CI] Enable pytest parallel run #1966

Those changes have a lot of benefits:

  • Standardize build and tests
  • Reduce single UT test job time cost
  • No Conda dependency, use setup-python in containers directly, test env totally isolated
  • Simplify the runner maintain effort, don't need to split one node as multiple runners
  • No change for matrix tests support

Please help to review this PR, after this PR land, we'll remove all single card runners with linux.idc.xpu label

@chuanqi129
Copy link
Contributor

By the way, the accelerate test has 3 failures, I guess they are not related with this PR changes

@chuanqi129
Copy link
Contributor

@mengfei25 please check the transformers test, there is no case to be executed actually.

@chuanqi129 chuanqi129 self-requested a review September 2, 2025 15:12
Copy link
Contributor

@dvrogozh dvrogozh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Transformers tests don't actually run anything after this change. I don't know the rootcause, but this needs to be debugged and fixed.

Important. Be extremely careful with parallelizing transformers tests. In the currently executing version parallelism of the tests is essentially switched off (see max parallel jobs settings set to 1). That was done for the reason that as seen as we are trying to parallelize we step into networking issues either on our side or on huggingface hub side. Last time we failed to overcome the problem and max parallel jobs was set to 1. The martix is still used to break the whole test into smaller chunks each around ~30mins or less. This allows to rerun smaller portion of the test on the failure instead of rerunning the whole suite.

steps:
- name: Cleanup workspace
run: |
sudo find ./ |grep -v "^\./$" |xargs sudo rm -rf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks weird. Basically that's a hack since github actions are supposed to take care of that entirely. Effectively you increase a fetch time for the repos as you are deleting everything each time as it seems. It seems there is some mess with access rights. Can this be fixed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Our CI has potential permission issue for WORKSPACE
  2. A clean WORKSPACE is important for CI to make sure correct
  3. Source code is lightweight, clone uses very less time

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See proposal on the resolution here: https://github.com/intel/torch-xpu-ops/pull/1999/files#r2330443890. We need to discuss if this can be done within this PR or later on.

Copy link
Contributor Author

@mengfei25 mengfei25 Sep 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed, will rm permissioned files in host manually if meets in the future

@dvrogozh
Copy link
Contributor

dvrogozh commented Sep 2, 2025

By the way, the accelerate test has 3 failures, I guess they are not related with this PR changes

These are new failures it seems. I see some test failures in this weekly as well and the a week old weekly was clean.

@mengfei25 mengfei25 requested a review from dvrogozh September 3, 2025 09:59
Copy link
Contributor

@dvrogozh dvrogozh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mengfei25, the errors in transformers and accelerate tests look unrelated to this PR and correlate with the current version of the test. I am impressed by the reduction of the transformers test execution time. I was afraid that we will again step into networking issue, but test results indicate that we did not.

However, I have 2 requests before merging this test:

  1. There are still few not answered questions from me in this PR. Can you, please, go thru them and reply?
  2. I would like to be on the safe side making sure we really can parallelize transforms side. I suggest to force rerun test few times to verify that we don't step into the problem. We can do that once we will finish discussion on the questions not yet closed.

runs-on: ${{ needs.prepare.outputs.runner_id }}
needs: prepare
container:
image: mengfeili/intel-pvc-driver:1146-1136
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is using image under user account (mengfeili) per-design? or left-over from the debug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No clean public image with driver only, so uses my image temporarily, consider to push such image to official hub later

Copy link
Contributor Author

@mengfei25 mengfei25 Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dvrogozh Commented all your questions, please help review.

@mengfei25
Copy link
Contributor Author

@dvrogozh Please help review it

@intel intel deleted a comment from chuanqi129 Sep 8, 2025
@mengfei25 mengfei25 requested a review from dvrogozh September 8, 2025 06:42
image: mengfeili/intel-pvc-driver:1146-1136
volumes:
- ${{ github.workspace }}:${{ github.workspace }}
options: --device=/dev/mem --device=/dev/dri --group-add video --security-opt seccomp=unconfined --cap-add=SYS_PTRACE --shm-size=8g
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Groups owning the device might differ depending on the OS. For modern Ubuntu that's video owning /dev/dri/cardN and render owning /dev/dri/renderDN. In our workflows we don't typically use /dev/dri/cardN and rely on /dev/dri/renderDN. This means that it's important to allow access to render group at the first place, but you have video instead. I suspect that render is hiding under needs.prepare.outputs.render_id. Is that right?

I think that this part of CI setup needs to be polished. I suggest you need the following:

  1. Pass render and video explicitly here, i.e. options: --group-add render --group-add video ...
  2. -u user:group is special. It's a key to managing permissions and is a reply to the above question (why you need sudo rm -rf to clean workspace). To fix that I suggest that you need to pass a user:group matching credentials of the user under which the runner runs. Thus should fix file permission issues. And you pass additional groups to access devices in addition explicitly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Good point, fixed
  2. Already with the runner user, removed the rm before checkout.

@mengfei25 mengfei25 requested a review from dvrogozh September 9, 2025 06:56
Copy link
Contributor

@dvrogozh dvrogozh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not ok with the last changes in the PR. A lot of new changes appeared all of the sudden which require review the PR anew while yesterday version was almost ready to go with few minor changes. We need align again.

cd transformers
python3 -m pytest -rsf --make-reports=$TEST_CASE --junit-xml=reports/$TEST_CASE.xml \
-k "${{ matrix.test.filter}}" ${{ matrix.test.cmd }} || true
xpu_list=($(echo ${ZE_AFFINITY_MASK} |sed 's/,/ /g'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes did not exist just yesterday. Why do we need them? I thought we have agreed that previous version was almost ready to go. I definitely has concerns with this new version. For example, output table became much bigger and complex: https://github.com/intel/torch-xpu-ops/actions/runs/17613758986?pr=1999. We don't need all these categories...

if(gpu==1 && $0~/Platform/){gpu=0}; if(gpu==1){print $0}; if($0~/Platform.*Graphics/){gpu=1}
}' |wc -l)"
cpus_per_xpu="$(echo |awk -v c="${cpu_num}" -v x="${xpu_num}" '{printf c/x}')"
# available gpu card number
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes were not here yesterday. Are they absolutely necessary? I also see that few other workloads besides accelerate and transformers got changed. Is this necessary as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants