-
Notifications
You must be signed in to change notification settings - Fork 55
[CI] Modify accelerate and transformers tests #1999
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…l/torch-xpu-ops into mengfeil/modify-extra-tests
…l/torch-xpu-ops into mengfeil/modify-extra-tests
Hi @dvrogozh , we're working on torch-xpu-ops CI/CD workflows refactor, mainly including 2 aspects:
Those changes have a lot of benefits:
Please help to review this PR, after this PR land, we'll remove all single card runners with |
By the way, the accelerate test has 3 failures, I guess they are not related with this PR changes |
@mengfei25 please check the transformers test, there is no case to be executed actually. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Transformers tests don't actually run anything after this change. I don't know the rootcause, but this needs to be debugged and fixed.
Important. Be extremely careful with parallelizing transformers tests. In the currently executing version parallelism of the tests is essentially switched off (see max parallel jobs settings set to 1). That was done for the reason that as seen as we are trying to parallelize we step into networking issues either on our side or on huggingface hub side. Last time we failed to overcome the problem and max parallel jobs was set to 1. The martix is still used to break the whole test into smaller chunks each around ~30mins or less. This allows to rerun smaller portion of the test on the failure instead of rerunning the whole suite.
steps: | ||
- name: Cleanup workspace | ||
run: | | ||
sudo find ./ |grep -v "^\./$" |xargs sudo rm -rf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks weird. Basically that's a hack since github actions are supposed to take care of that entirely. Effectively you increase a fetch time for the repos as you are deleting everything each time as it seems. It seems there is some mess with access rights. Can this be fixed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Our CI has potential permission issue for WORKSPACE
- A clean WORKSPACE is important for CI to make sure correct
- Source code is lightweight, clone uses very less time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See proposal on the resolution here: https://github.com/intel/torch-xpu-ops/pull/1999/files#r2330443890. We need to discuss if this can be done within this PR or later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed, will rm permissioned files in host manually if meets in the future
These are new failures it seems. I see some test failures in this weekly as well and the a week old weekly was clean. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mengfei25, the errors in transformers and accelerate tests look unrelated to this PR and correlate with the current version of the test. I am impressed by the reduction of the transformers test execution time. I was afraid that we will again step into networking issue, but test results indicate that we did not.
However, I have 2 requests before merging this test:
- There are still few not answered questions from me in this PR. Can you, please, go thru them and reply?
- I would like to be on the safe side making sure we really can parallelize transforms side. I suggest to force rerun test few times to verify that we don't step into the problem. We can do that once we will finish discussion on the questions not yet closed.
runs-on: ${{ needs.prepare.outputs.runner_id }} | ||
needs: prepare | ||
container: | ||
image: mengfeili/intel-pvc-driver:1146-1136 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is using image under user account (mengfeili
) per-design? or left-over from the debug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No clean public image with driver only, so uses my image temporarily, consider to push such image to official hub later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dvrogozh Commented all your questions, please help review.
@dvrogozh Please help review it |
image: mengfeili/intel-pvc-driver:1146-1136 | ||
volumes: | ||
- ${{ github.workspace }}:${{ github.workspace }} | ||
options: --device=/dev/mem --device=/dev/dri --group-add video --security-opt seccomp=unconfined --cap-add=SYS_PTRACE --shm-size=8g |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Groups owning the device might differ depending on the OS. For modern Ubuntu that's video
owning /dev/dri/cardN
and render
owning /dev/dri/renderDN
. In our workflows we don't typically use /dev/dri/cardN
and rely on /dev/dri/renderDN
. This means that it's important to allow access to render
group at the first place, but you have video
instead. I suspect that render
is hiding under needs.prepare.outputs.render_id
. Is that right?
I think that this part of CI setup needs to be polished. I suggest you need the following:
- Pass
render
andvideo
explicitly here, i.e.options: --group-add render --group-add video ...
-u user:group
is special. It's a key to managing permissions and is a reply to the above question (why you needsudo rm -rf
to clean workspace). To fix that I suggest that you need to pass auser:group
matching credentials of the user under which the runner runs. Thus should fix file permission issues. And you pass additional groups to access devices in addition explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Good point, fixed
- Already with the runner user, removed the
rm
before checkout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not ok with the last changes in the PR. A lot of new changes appeared all of the sudden which require review the PR anew while yesterday version was almost ready to go with few minor changes. We need align again.
cd transformers | ||
python3 -m pytest -rsf --make-reports=$TEST_CASE --junit-xml=reports/$TEST_CASE.xml \ | ||
-k "${{ matrix.test.filter}}" ${{ matrix.test.cmd }} || true | ||
xpu_list=($(echo ${ZE_AFFINITY_MASK} |sed 's/,/ /g')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes did not exist just yesterday. Why do we need them? I thought we have agreed that previous version was almost ready to go. I definitely has concerns with this new version. For example, output table became much bigger and complex: https://github.com/intel/torch-xpu-ops/actions/runs/17613758986?pr=1999. We don't need all these categories...
if(gpu==1 && $0~/Platform/){gpu=0}; if(gpu==1){print $0}; if($0~/Platform.*Graphics/){gpu=1} | ||
}' |wc -l)" | ||
cpus_per_xpu="$(echo |awk -v c="${cpu_num}" -v x="${xpu_num}" '{printf c/x}')" | ||
# available gpu card number |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes were not here yesterday. Are they absolutely necessary? I also see that few other workloads besides accelerate and transformers got changed. Is this necessary as well?
test accelerate and transformers only
disable_build
disable_ut
disable_distributed