Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue running 'PhoenixOS-Remoting' separately, with stable-diffusion (pytorch version) as AI app #17

Open
zaglc opened this issue Jan 4, 2025 · 11 comments

Comments

@zaglc
Copy link

zaglc commented Jan 4, 2025

1 Problem description

I came across an cuda error 209 when running stable-difussion app (arg: batch=2, iter=2, inference.py), where the programme cannot find kernel image, possibly not supporting exteral .so. I came across the same issue when running original cricket, even though in your work Characterizing Network Requirements for GPU API Remoting in AI Applications have supported running SD in pytorch version.
client报错-无忧化phoenixOS

Here are my compiling arguments, VERSION=NO_OPTIMIZATION or version where both async, cache, handler are included results the same error.

LOG=INFO VERSION=NO_OPTIMIZATION make

2 Environment setup

My machine is ubuntu 22.04 with one nvidia A4500 (sm=80), driver version=535.183.06, cuda version=11.8. I tried several solution, none of them succeed

  • directly compiling on physical machine: raise error when compiling cuda-gdb=11.1 (one config.h is not found)
  • use cuda-gdb 11.8 source rpm: lots of path issues due to version discrepancy (e.g. build/gnu->build-gnu) and it's impossible to figure out all wrong path for me
  • use provided dockerfile (cuda11.1) by this repo, and issue from original cricket: NVML: version discrepancy in Driver/Lib, can only download NVML>=565.

Then I pulled nvidia docker image nvidia/cuda:11.1.1-cudnn8-devel-rockylinux8 and tried to build env based on it according to your dockerfile. What follows is the same problem in 1

I ran pytorch SD app under miniconda, here are my envs:

python             3.8.0
accelerate         0.20.1
certifi            2024.12.14
charset-normalizer 3.4.1
diffusers          0.9.0
filelock           3.16.1
fsspec             2024.12.0
huggingface-hub    0.24.6
idna               3.10
importlib_metadata 8.5.0
numpy              1.24.4
packaging          24.2
pillow             10.4.0
pip                24.2
psutil             6.1.1
PyYAML             6.0.2
regex              2024.11.6
requests           2.32.3
safetensors        0.5.0
sentencepiece      0.2.0
setuptools         75.1.0
tokenizers         0.13.3
torch              1.10.1+cu111
torchaudio         0.10.1+cu111
torchvision        0.11.2+cu111
tqdm               4.67.1
transformers       4.30.0
typing_extensions  4.12.2
urllib3            2.2.3
wheel              0.44.0
zipp               3.20.2

3 Other modification

Due to the dependency on main PhoenixOS, I cancel the following code in cpu/proxy/svc.cpp, which is not included in POS_ENABLE
svc
and I have manually disabled compilation of tests and bin/tests in the main Makefile.

I sincerely hope that you can figure out my omissive steps, or other extra traceback infos I can provide, or provide an executable configuration or Dockerfile or DockerImage.

Thanks a lot

@913887524gsd
Copy link

What's your commit number for now? I used to communicated with authors and they emergently pushed a temporary version, so the version control became a totally mess... Its remoting framework is not compatible with phos now, you may need to roll back some commits.

@zaglc
Copy link
Author

zaglc commented Jan 6, 2025

Thanks for your reply, the commit of phoenixos-remote id is 821b3b713590427236eca2a8551aad95dae30550.

Moreover, I noticed in #9 that you has uploaded some docker file, so I used phoenixos/pytorch:11.3-ubuntu20.04, now compiling successed. And I think that your remote repo is strongly dependent on PhoenixOS, so I tried to build and run PhoenixOS in phoenixos/pytorch:11.3-ubuntu20.04.

However when I build phoenixos following your document in examples/resnet, it raise following errors.
client side (bus error):

process id: 42030
+00:00:02.052061 WARNING: could not find .nv.info section. This means this binary does not contain any kernels.	in cpu-elf2.c:922
+00:00:02.060230 WARNING: could not find .nv.info section. This means this binary does not contain any kernels.	in cpu-elf2.c:922
+00:00:02.060539 WARNING: could not find .nv.info section. This means this binary does not contain any kernels.	in cpu-elf2.c:922
+00:00:02.060718 WARNING: could not find .nv.info section. This means this binary does not contain any kernels.	in cpu-elf2.c:922
+00:00:02.060836 WARNING: could not find .nv.info section. This means this binary does not contain any kernels.	in cpu-elf2.c:922
+00:00:02.061186 WARNING: could not find .nv.info section. This means this binary does not contain any kernels.	in cpu-elf2.c:922
+00:00:02.061481 WARNING: could not find .nv.info section. This means this binary does not contain any kernels.	in cpu-elf2.c:922
+00:00:02.061683 WARNING: could not find .nv.info section. This means this binary does not contain any kernels.	in cpu-elf2.c:922
Bus error (core dumped)

server side:

 POS Log  >>>>>>>>>> PhOS Workspace <<<<<<<<<<
 _____  _                      _       ____   _____
|  __ \| |                    (_)     / __ \ / ____|
| |__) | |__   ___   ___ _ __  ___  _| |  | | (___
|  ___/| '_ \ / _ \ / _ \ '_ \| \ \/ / |  | |\___ \
| |    | | | | (_) |  __/ | | | |>  <| |__| |____) |
|_|    |_| |_|\___/ \___|_| |_|_/_/\_\\____/|_____/

 POS Log  PhoenixOS workspace created, welcome!
+00:00:00.044937 INFO:	waiting for RPC requests...
Cache Optimization: Enabled!
Async Optimization: Enabled!
Handler Optimization: Enabled!
xpu remote address: localhost
create shm buffer
 POS Log  [POSParser @ 0x7f9c0c001040]  parser started
 POS Log  [POSWorker @ 0x7f9c0c001df0]  worker started
 POS Log  reserved virtual memory space: device_id(0), base(0x7facd0000000), size(18306039808)
+00:00:05.114455 INFO:	string: "hello"

 POS Log  [POSParser @ 0x7f9c0c0063e0]  parser started
 POS Log  [POSWorker @ 0x7f9c0c007160]  worker started
 POS Log  [HandleManager Metrics] cuda_memory:
[Ticker Metric Report] Restore State:
  max: N/A ms
  min: N/A ms
  avg: N/A ms
  sum: N/A ms
  p10: N/A ms
  p50: N/A ms
  p99: N/A ms

 POS Log  [HandleManager Metrics] cuda_module:
[Ticker Metric Report] Restore State:
  max: N/A ms
  min: N/A ms
  avg: N/A ms
  sum: N/A ms
  p10: N/A ms
  p50: N/A ms
  p99: N/A ms

[Reducer Metric Report] # Restored Functions:
  max: N/A
  min: N/A
  sum: N/A
  avg: N/A

 POS Log  finish dump kernel metadata to /var/log/phos/daemon/resnet152_kernel_metas.txt
 POS Log  [POSParser @ 0x7f9c0c0063e0]  parser daemon thread shutdown
 POS Log  [Parser Metrics]:
[Reducer Metric Report] KERNEL_out_memories:
  max: N/A
  min: N/A
  sum: N/A
  avg: N/A
[Reducer Metric Report] KERNEL_in_memories:
  max: N/A
  min: N/A
  sum: N/A
  avg: N/A

[Reducer Metric Report] KERNEL_number_of_vendor_kernels: N/A
[Reducer Metric Report] KERNEL_number_of_user_kernels: N/A

 POS Log  [POSWorker @ 0x7f9c0c007160]  worker daemon thread shutdown
 POS Log  [Worker Metrics]:
[Ticker Metric Report] On-demand Reload State (by Worker Thread):
  max: N/A ms
  min: N/A ms
  avg: N/A ms
  sum: N/A ms
  p10: N/A ms
  p50: N/A ms
  p99: N/A ms
[Ticker Metric Report] On-demand Reload (by Worker Thread):
  max: N/A ms
  min: N/A ms
  avg: N/A ms
  sum: N/A ms
  p10: N/A ms
  p50: N/A ms
  p99: N/A ms
[Ticker Metric Report] Unexecuted APIs:
  max: N/A ms
  min: N/A ms
  avg: N/A ms
  sum: N/A ms
  p10: N/A ms
  p50: N/A ms
  p99: N/A ms
[Ticker Metric Report] Recomputation APIs:
  max: N/A ms
  min: N/A ms
  avg: N/A ms
  sum: N/A ms
  p10: N/A ms
  p50: N/A ms
  p99: N/A ms

[Reducer Metric Report] # On-demand Reload Handles with State (by Worker Thread): N/A
[Reducer Metric Report] # On-demand Reload Handles (by Worker Thread): N/A

[Reducer Metric Report] On-demand Reload Bytes (by Worker Thread):
  max: N/A
  min: N/A
  sum: N/A
  avg: N/A

[Sequence Metric Report] # On-demand Restore Handles: 
    timeline: N/A
    values:   N/A
[Sequence Metric Report] # On-demand Restore Handles (with State): 
    timeline: N/A
    values:   N/A
[Sequence Metric Report] On-demand Restore State Size (byte): 
    timeline: N/A
    values:   N/A
[Sequence Metric Report] On-demand Restore Duration (ms): 
    timeline: N/A
    values:   N/A
[Sequence Metric Report] On-demand Restore State Duration (ms): 
    timeline: N/A
    values:   N/A

And I interrupt the daemon process and want to re-run, it threw following err msg. I was running proxy process inside the container, does it concern?
server side:

 POS Warn  [POSOobClient @ 0x5632d7b93250]  failed to bind out-of-band UDP client to "0.0.0.0:10086": Address already in use, try 1th time to switch port to 10087
file:       ../pos/include/oob.h;
function:   POSOobServer::POSOobServer(POSWorkspace*, std::map<pos_oob_msg_typeid_t, unsigned char (*)(int, sockaddr_in*, POSOobMsg*, POSWorkspace*, POSOobServer*)>, const char*, uint16_t);
line:       246;
 POS Warn  [POSOobServer @ 0x559d97288790]  failed to obtain socket address for the main session: Address already in use
 POS Error  [POSOobServer @ 0x559d97288790]  failed to create OOB main session
terminate called after throwing an instance of 'char const*'
Aborted (core dumped)
 POS Warn  failed execution of command cricket-rpc-server 2>&1: exit_code(134)
 POS Warn  phosd start failed
 POS Error  CLI executed failed
terminate called after throwing an instance of 'char const*'
Aborted (core dumped)

@913887524gsd
Copy link

That's not me, I'm not the author... Btw, I mean the commit number of PhoenixOS and it's a2e019e3b6b8210058c5ba7e679a8665676686a3 now, but 2f4122fc5ea5d59a4ee844b3441b80aeddfeac8a is the compatible one.

@zaglc
Copy link
Author

zaglc commented Jan 6, 2025

Sorry I misunderstand, I will try this commit id later

@zaglc
Copy link
Author

zaglc commented Jan 20, 2025

Commit 2f4122fc5ea5d59a4ee844b3441b80aeddfeac8a is still raising errors on my machine. The server side presents N/A log as I have shown before, while the client side seems to encounter fatbin-init error:

root@5f6787e24b0c:~/examples/resnet# env $phos python3 ./train.py
+00:00:00.226614 ERROR: error registering fatbin: 6	in cpu-client.c:479
+00:00:00.227711 ERROR: error registering fatbin: 32652	in cpu-client.c:479
+00:00:00.227722 ERROR: error registering function: 298883840	in cpu-client.c:438
+00:00:00.227730 WARNING: fatCubinHandle is NULL - so we have nothing to unload. (This is okay if this binary does not contain a kernel.)	in cpu-client.c:498
+00:00:00.251798 INFO:	api-call-cnt: 0
+00:00:00.251822 INFO:	memcpy-cnt: 0
----client_total_infos----
api 1: count 1, client_total_time 90775.000000
api 50: count 3, client_total_time 15849.333333
api 2: count 1, client_total_time 17902.000000
api 53: count 1, client_total_time 675.000000

I first build environments under commit a2e019e3b6b8210058c5ba7e679a8665676686a3 then checkout 2f4122fc5ea5d59a4ee844b3441b80aeddfeac8a. Does this count? @913887524gsd could you please show you detailed steps beyond their documents if you have modified anything? @zobinHuang could you please share some insights for debugging?

@913887524gsd
Copy link

913887524gsd commented Jan 20, 2025

Have you removed client_exist.txt?

@zaglc
Copy link
Author

zaglc commented Jan 20, 2025

Have you removed client_exist.txt?

I tried this just now, but the error message on the client side turns out to be Bus error (core dump) again after showing could not find .nv.info section., while that on the server side is still N/A

@913887524gsd
Copy link

Have you removed client_exist.txt?

I tried this just now, but the error message on the client side turns out to be Bus error (core dump) again after showing could not find .nv.info section., while that on the server side is still N/A

I don't know either, I have not met this issue qwq.
Sounds you use socket rather than IB as communication.
Maybe something incompatible with your test suites?

@zaglc
Copy link
Author

zaglc commented Jan 20, 2025

That's fine thanks. Did you succeed in running train.py with IB enabled? It's an optional configuration argument in their remote, but I can't find any interface for this in the current repo. Do you have any idea?

@913887524gsd
Copy link

As I known, this is their external version of code and remain many mysterious things left.
I dived into their source codes and not fully understood how this system works XD.
Maybe I should take more time to sort out the codes but I'm encountering health problems so I have to stop it for now.
Sorry, I can't give you any valuable advice qwq.

@zaglc
Copy link
Author

zaglc commented Jan 20, 2025

Sorry to hear that, hope you will get well soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants