Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent RPC Failed while cloning from github.com #111

Open
zxiiro opened this issue Apr 5, 2024 · 6 comments
Open

Intermittent RPC Failed while cloning from github.com #111

zxiiro opened this issue Apr 5, 2024 · 6 comments
Assignees
Labels
workstream/linux-cpu Get CPU jobs working on linux

Comments

@zxiiro
Copy link
Collaborator

zxiiro commented Apr 5, 2024

I'm seeing intermittent failures to clone the packages triton and ucx from github.com when running tests in ARC. I'm not sure if it affects non-arc runners as I haven't done any work on that side but these intermittent failures are fairly frequent and quite annoying as re-running the job a few times will evetually get them to resolve but it would be good if there was a solution that didn't require so many re-runs.

47.38   Running command git clone --filter=blob:none --quiet https://github.com/openai/triton /tmp/pip-req-build-cqkk4lff
47.38   error: RPC failed; curl 92 HTTP/2 stream 0 was not closed cleanly: CANCEL (err 8)
47.38   fatal: the remote end hung up unexpectedly
47.38   fatal: early EOF
47.38   fatal: index-pack failed
47.38   error: subprocess-exited-with-error
0.622 + git clone --recursive https://github.com/openucx/ucx.git
0.654 Cloning into 'ucx'...
35.99 error: RPC failed; curl 56 GnuTLS recv error (-9): Error decoding the received TLS packet.
35.99 fatal: the remote end hung up unexpectedly
35.99 fatal: early EOF
36.00 fatal: index-pack failed
ERROR: process "/bin/sh -c if [ -n \"${UCX_COMMIT}\" ] && [ -n \"${UCC_COMMIT}\" ]; then bash ./install_ucc.sh; fi" did not complete successfully: exit code: 128
@zxiiro zxiiro self-assigned this Apr 5, 2024
@zxiiro
Copy link
Collaborator Author

zxiiro commented Apr 5, 2024

I tried adding this code to install_base.sh but it didn't seem to help as I got TLS errors printing instead. I have a suspicion that maybe it's related to timeout as the git clone is actually coming from pip install which I believe by default has a 15 second timeout.

# Set a high postbuffer to avoid git clone error for some repos.
#     error: RPC failed; curl 92 HTTP/2 stream 0 was not closed cleanly
git config --global http.postBuffer 1048576000
git config --global https.postBuffer 1048576000

I think adding --timeout 300 (not sure what the appropriate timeout should be) to the pip install commands might help allow the clone to succeed more often.

Ref: https://stackoverflow.com/questions/50305112/pip-install-timeout-issue

@zxiiro
Copy link
Collaborator Author

zxiiro commented Apr 8, 2024

Unfortunately the timeout change didn't seem to help. I'm still seeing the RCP failure during pip install.

@jeanschmidt
Copy link
Contributor

that would be very strange if it only happens on ARC, as I can't see why changing the runner would make git clone to fail...

@zxiiro
Copy link
Collaborator Author

zxiiro commented Apr 18, 2024

I do feel like this is something related to some kind of git timeout while cloning the repo. If you look at the timestamp it's always around 35ish seconds (not sure if that 12.25 ... 47.55 is seconds or what unit) but it's been around that much time in all the instances of the failure I've seen. Maybe setting the timeout for pip install isn't enough.

[17647](https://github.com/pytorch/pytorch-canary/actions/runs/8710434864/job/23934318362?pr=208#step:6:17651)
#49 12.45 + sudo -E -H -u jenkins env -u SUDO_UID -u SUDO_GID -u SUDO_COMMAND -u SUDO_USER env PATH=/opt/conda/envs/py_3.10/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64 conda run -n py_3.10 pip install --timeout 300 --progress-bar off git+https://github.com/openai/triton@989adb9a29496c22a36ef82ca69cad5dad536b9c#subdirectory=python
[17648](https://github.com/pytorch/pytorch-canary/actions/runs/8710434864/job/23934318362?pr=208#step:6:17652)
#49 47.55   Running command git clone --filter=blob:none --quiet https://github.com/openai/triton /tmp/pip-req-build-6x2m6292

@ZainRizvi ZainRizvi added the workstream/linux-cpu Get CPU jobs working on linux label Apr 30, 2024
@ZainRizvi ZainRizvi added this to the ARC Runner Reliability milestone Apr 30, 2024
@zxiiro
Copy link
Collaborator Author

zxiiro commented May 15, 2024

that would be very strange if it only happens on ARC, as I can't see why changing the runner would make git clone to fail...

I suspect this likely doesn't only happen in ARC. But we likely see it happening more often in ARC because the ARC builds are not cached like the non-arc builds. The calculate-docker action seems to run every time in ARC (which is yet another issue with ARC).

So I think this is a worthwhile problem to figure out that's lower priority even for non-arc jobs as it will increase stability of the build.

@jeanschmidt
Copy link
Contributor

I have this open issue, so there are more external dependencies problems: pytorch/pytorch#124825

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
workstream/linux-cpu Get CPU jobs working on linux
Projects
None yet
Development

No branches or pull requests

3 participants