Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized SD3 pipeline #1682

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

deepak-gowda-narayana
Copy link
Contributor

@deepak-gowda-narayana deepak-gowda-narayana commented Jan 8, 2025

What does this PR do?

  • HPU graphs enabled
  • Batching for inference enabled
  • Fused SDPA integrated
  • FP8 quantization enabled
  • Updated run command in ReadMe and added documentation to run FP8

Performance comparison of Pre and Post Optimizations for SD3 Pipeline

Achieved ~4x throughput improvement with HPU Graph and Fused SDPA

Batch Size No of Images Batches Pre-Optimization Throughput (samples/sec) Optimized Throughput (samples/sec)
1 10 10 0.047 0.227
2 20 10 0.051 0.228
4 40 10 0.052 0.228
8 80 10 0.052 0.227

Diffusers CI Tests Pass

================================================================================= test session starts ==================================================================================
platform linux -- Python 3.10.12, pytest-7.4.4, pluggy-1.5.0
rootdir: /mnt/deepak_oh_new
configfile: setup.cfg
collected 165 items                                                                                                                                                                    

tests/test_diffusers.py .......sssss..........s........s.....s..................s.ss.sssssssssssss.............sss.......s....s.....ss.s.ssssss......s...s......s.sss........s.. [ 92%]
s.s....s.s...        
============================================================== 116 passed, 49 skipped, 280 warnings in 1495.78s (0:24:55) ==============================================================

This PR is jointly co-authored with:

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

 * HPU graphs enabled
 * Batching for inference enabled
 * Fused SDPA integrated
 * FP8 quantization enabled

Co-authored-by: Daniel Socek <[email protected]>
@deepak-gowda-narayana deepak-gowda-narayana changed the title Optimized SD3 pipeline: Optimized SD3 pipeline Jan 8, 2025
@deepak-gowda-narayana
Copy link
Contributor Author

@libinta @imangohari1 @regisss Request to review PR

@dsocek
Copy link
Contributor

dsocek commented Jan 8, 2025

Additional performance results of this PR:

Device Docker Mode Num Inference Steps Batch Size Sec/Batch Step Steps/sec Samples/sec Time for 1 Image
Gaudi3 1.19.0 Lazy w/ HPU Graphs 28 1 0.09 10.64 0.380 2.63
Gaudi2 1.19.0 Lazy w/ HPU Graphs 28 1 0.15 6.48 0.231 4.32
Gaudi2 1.19.0 Lazy w/ HPU Graphs 28 10 1.53 6.54 0.234 4.28

@sywangyi
Copy link
Collaborator

@deepak-gowda-narayana
I see in the PR
with torch.autocast(device_type="hpu", dtype=torch.bfloat16, enabled=self.gaudi_config.use_torch_autocast):
this logic is removed, it will lead to crash if "--bf16" is not in commandline in Gaudi2D platform which only contains BF16 gemm.
error like
[INFO|pipeline_stable_diffusion_3.py:569] 2025-01-10 05:26:37,344 >> 1 prompt(s) received, 10 generation(s) per prompt, 1 sample(s) per batch, 10 total batch(es).
Traceback (most recent call last):
File "/root/optimum-habana/examples/stable-diffusion/text_to_image_generation.py", line 701, in
main()
File "/root/optimum-habana/examples/stable-diffusion/text_to_image_generation.py", line 664, in main
outputs = pipeline(prompt=args.prompts, **kwargs_call)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/root/optimum-habana/optimum/habana/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3.py", line 610, in call
ht.hpu.synchronize()
File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/init.py", line 154, in synchronize
return _hpu_C.synchronize_device()
RuntimeError: synNodeCreateWithId failed for node: batch_gemm with synStatus 26 [Generic failure]. .

@heyuanliu-intel

@sywangyi
Copy link
Collaborator

[error][tid:C62] FP32 operations are not supported on this device. Node Name BatchGemm135

@dsocek
Copy link
Contributor

dsocek commented Jan 10, 2025

@deepak-gowda-narayana I see in the PR with torch.autocast(device_type="hpu", dtype=torch.bfloat16, enabled=self.gaudi_config.use_torch_autocast): this logic is removed, it will lead to crash if "--bf16" is not in commandline in Gaudi2D platform which only contains BF16 gemm. error like [INFO|pipeline_stable_diffusion_3.py:569] 2025-01-10 05:26:37,344 >> 1 prompt(s) received, 10 generation(s) per prompt, 1 sample(s) per batch, 10 total batch(es). Traceback (most recent call last): File "/root/optimum-habana/examples/stable-diffusion/text_to_image_generation.py", line 701, in main() File "/root/optimum-habana/examples/stable-diffusion/text_to_image_generation.py", line 664, in main outputs = pipeline(prompt=args.prompts, **kwargs_call) File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/root/optimum-habana/optimum/habana/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3.py", line 610, in call ht.hpu.synchronize() File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/init.py", line 154, in synchronize return _hpu_C.synchronize_device() RuntimeError: synNodeCreateWithId failed for node: batch_gemm with synStatus 26 [Generic failure]. .

@heyuanliu-intel

@sywangyi Thank you for pointing this out. We removed the with torch.autocast(device_type="hpu", dtype=torch.bfloat16, enabled=self.gaudi_config.use_torch_autocast): to allow users to use autocasting (and in addition we saw slight performance boost). Removing --bf16 works without issues on regular G2/G3. Rather than putting a blanket BF16 restriction on all ops, maybe better strategy should be to put a note saying that on some more limited Gaudi devices (e.g. Gaudi2D) please use --bf16 as some ops do not support FP32. This way we still allow users to do more fine grained optimizations with custom ops lists (some ops on BF16 and some of FP32).

@sywangyi
Copy link
Collaborator

@deepak-gowda-narayana I see in the PR with torch.autocast(device_type="hpu", dtype=torch.bfloat16, enabled=self.gaudi_config.use_torch_autocast): this logic is removed, it will lead to crash if "--bf16" is not in commandline in Gaudi2D platform which only contains BF16 gemm. error like [INFO|pipeline_stable_diffusion_3.py:569] 2025-01-10 05:26:37,344 >> 1 prompt(s) received, 10 generation(s) per prompt, 1 sample(s) per batch, 10 total batch(es). Traceback (most recent call last): File "/root/optimum-habana/examples/stable-diffusion/text_to_image_generation.py", line 701, in main() File "/root/optimum-habana/examples/stable-diffusion/text_to_image_generation.py", line 664, in main outputs = pipeline(prompt=args.prompts, **kwargs_call) File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/root/optimum-habana/optimum/habana/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3.py", line 610, in call ht.hpu.synchronize() File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/init.py", line 154, in synchronize return _hpu_C.synchronize_device() RuntimeError: synNodeCreateWithId failed for node: batch_gemm with synStatus 26 [Generic failure]. .
@heyuanliu-intel

@sywangyi Thank you for pointing this out. We removed the with torch.autocast(device_type="hpu", dtype=torch.bfloat16, enabled=self.gaudi_config.use_torch_autocast): to allow users to use autocasting (and in addition we saw slight performance boost). Removing --bf16 works without issues on regular G2/G3. Rather than putting a blanket BF16 restriction on all ops, maybe better strategy should be to put a note saying that on some more limited Gaudi devices (e.g. Gaudi2D) please use --bf16 as some ops do not support FP32. This way we still allow users to do more fine grained optimizations with custom ops lists (some ops on BF16 and some of FP32).

do you mean user could use like "PT_HPU_AUTOCAST_LOWER_PRECISION_OPS_LIST=ops_bf16.txt python3 xxxx" even if torch.autocast is not there?

@dsocek
Copy link
Contributor

dsocek commented Jan 13, 2025

@deepak-gowda-narayana I see in the PR with torch.autocast(device_type="hpu", dtype=torch.bfloat16, enabled=self.gaudi_config.use_torch_autocast): this logic is removed, it will lead to crash if "--bf16" is not in commandline in Gaudi2D platform which only contains BF16 gemm. error like [INFO|pipeline_stable_diffusion_3.py:569] 2025-01-10 05:26:37,344 >> 1 prompt(s) received, 10 generation(s) per prompt, 1 sample(s) per batch, 10 total batch(es). Traceback (most recent call last): File "/root/optimum-habana/examples/stable-diffusion/text_to_image_generation.py", line 701, in main() File "/root/optimum-habana/examples/stable-diffusion/text_to_image_generation.py", line 664, in main outputs = pipeline(prompt=args.prompts, **kwargs_call) File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/root/optimum-habana/optimum/habana/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3.py", line 610, in call ht.hpu.synchronize() File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/init.py", line 154, in synchronize return _hpu_C.synchronize_device() RuntimeError: synNodeCreateWithId failed for node: batch_gemm with synStatus 26 [Generic failure]. .
@heyuanliu-intel

@sywangyi Thank you for pointing this out. We removed the with torch.autocast(device_type="hpu", dtype=torch.bfloat16, enabled=self.gaudi_config.use_torch_autocast): to allow users to use autocasting (and in addition we saw slight performance boost). Removing --bf16 works without issues on regular G2/G3. Rather than putting a blanket BF16 restriction on all ops, maybe better strategy should be to put a note saying that on some more limited Gaudi devices (e.g. Gaudi2D) please use --bf16 as some ops do not support FP32. This way we still allow users to do more fine grained optimizations with custom ops lists (some ops on BF16 and some of FP32).

do you mean user could use like "PT_HPU_AUTOCAST_LOWER_PRECISION_OPS_LIST=ops_bf16.txt python3 xxxx" even if torch.autocast is not there?

@sywangyi Thanks. Yes that is one way, or one can add explicit lists directly via config file like this: https://huggingface.co/Habana/stable-diffusion-2/blob/main/gaudi_config.json, which is then handled in base GaudiDiffusers class: https://github.com/huggingface/optimum-habana/blob/main/optimum/habana/diffusers/pipelines/pipeline_utils.py#L161
I am not clear of if we still need to use with torch.autocast(enable=True) in the specific pipeline for the lists to actually work or this is somehow handled internally by GC (via Runtime Var lists).

We did some testing with Flux pipeline and there we did see different ops being cast internally if we have with torch.autocast(..) vs. if we don't. Better performance was observed without with torch.autocast(..).

Maybe to be safe we could add back the context and set enable to config value but we should not force dtype=torch.bfloat16 unless --bf16 is used. Will do some additional testing and update.

@deepak-gowda-narayana
Copy link
Contributor Author

@sywangyi Daniel has made updates to autocasting in SD3, Please re-review PR

Copy link
Contributor

@skavulya skavulya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants