Optimized SD3 pipeline #1682

deepak-gowda-narayana · 2025-01-08T01:11:50Z

What does this PR do?

HPU graphs enabled
Batching for inference enabled
Fused SDPA integrated
FP8 quantization enabled
Updated run command in ReadMe and added documentation to run FP8

Performance comparison of Pre and Post Optimizations for SD3 Pipeline

Achieved ~4x throughput improvement with HPU Graph and Fused SDPA

Batch Size	No of Images	Batches	Pre-Optimization Throughput (samples/sec)	Optimized Throughput (samples/sec)
1	10	10	0.047	0.227
2	20	10	0.051	0.228
4	40	10	0.052	0.228
8	80	10	0.052	0.227

Diffusers CI Tests Pass

================================================================================= test session starts ==================================================================================
platform linux -- Python 3.10.12, pytest-7.4.4, pluggy-1.5.0
rootdir: /mnt/deepak_oh_new
configfile: setup.cfg
collected 165 items                                                                                                                                                                    

tests/test_diffusers.py .......sssss..........s........s.....s..................s.ss.sssssssssssss.............sss.......s....s.....ss.s.ssssss......s...s......s.sss........s.. [ 92%]
s.s....s.s...        
============================================================== 116 passed, 49 skipped, 280 warnings in 1495.78s (0:24:55) ==============================================================

This PR is jointly co-authored with:

Daniel Socek [email protected]

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

* HPU graphs enabled * Batching for inference enabled * Fused SDPA integrated * FP8 quantization enabled Co-authored-by: Daniel Socek <[email protected]>

deepak-gowda-narayana · 2025-01-08T18:23:37Z

@libinta @imangohari1 @regisss Request to review PR

dsocek · 2025-01-08T23:22:51Z

Additional performance results of this PR:

Device	Docker	Mode	Num Inference Steps	Batch Size	Sec/Batch Step	Steps/sec	Samples/sec	Time for 1 Image
Gaudi3	1.19.0	Lazy w/ HPU Graphs	28	1	0.09	10.64	0.380	2.63
Gaudi2	1.19.0	Lazy w/ HPU Graphs	28	1	0.15	6.48	0.231	4.32
Gaudi2	1.19.0	Lazy w/ HPU Graphs	28	10	1.53	6.54	0.234	4.28

sywangyi · 2025-01-10T05:35:37Z

@deepak-gowda-narayana
I see in the PR
with torch.autocast(device_type="hpu", dtype=torch.bfloat16, enabled=self.gaudi_config.use_torch_autocast):
this logic is removed, it will lead to crash if "--bf16" is not in commandline in Gaudi2D platform which only contains BF16 gemm.
error like
[INFO|pipeline_stable_diffusion_3.py:569] 2025-01-10 05:26:37,344 >> 1 prompt(s) received, 10 generation(s) per prompt, 1 sample(s) per batch, 10 total batch(es).
Traceback (most recent call last):
File "/root/optimum-habana/examples/stable-diffusion/text_to_image_generation.py", line 701, in
main()
File "/root/optimum-habana/examples/stable-diffusion/text_to_image_generation.py", line 664, in main
outputs = pipeline(prompt=args.prompts, **kwargs_call)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/root/optimum-habana/optimum/habana/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3.py", line 610, in call
ht.hpu.synchronize()
File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/init.py", line 154, in synchronize
return _hpu_C.synchronize_device()
RuntimeError: synNodeCreateWithId failed for node: batch_gemm with synStatus 26 [Generic failure]. .

@heyuanliu-intel

sywangyi · 2025-01-10T05:37:56Z

[error][tid:C62] FP32 operations are not supported on this device. Node Name BatchGemm135

dsocek · 2025-01-10T16:16:15Z

@deepak-gowda-narayana I see in the PR with torch.autocast(device_type="hpu", dtype=torch.bfloat16, enabled=self.gaudi_config.use_torch_autocast): this logic is removed, it will lead to crash if "--bf16" is not in commandline in Gaudi2D platform which only contains BF16 gemm. error like [INFO|pipeline_stable_diffusion_3.py:569] 2025-01-10 05:26:37,344 >> 1 prompt(s) received, 10 generation(s) per prompt, 1 sample(s) per batch, 10 total batch(es). Traceback (most recent call last): File "/root/optimum-habana/examples/stable-diffusion/text_to_image_generation.py", line 701, in main() File "/root/optimum-habana/examples/stable-diffusion/text_to_image_generation.py", line 664, in main outputs = pipeline(prompt=args.prompts, **kwargs_call) File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/root/optimum-habana/optimum/habana/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3.py", line 610, in call ht.hpu.synchronize() File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/init.py", line 154, in synchronize return _hpu_C.synchronize_device() RuntimeError: synNodeCreateWithId failed for node: batch_gemm with synStatus 26 [Generic failure]. .

@heyuanliu-intel

@sywangyi Thank you for pointing this out. We removed the with torch.autocast(device_type="hpu", dtype=torch.bfloat16, enabled=self.gaudi_config.use_torch_autocast): to allow users to use autocasting (and in addition we saw slight performance boost). Removing --bf16 works without issues on regular G2/G3. Rather than putting a blanket BF16 restriction on all ops, maybe better strategy should be to put a note saying that on some more limited Gaudi devices (e.g. Gaudi2D) please use --bf16 as some ops do not support FP32. This way we still allow users to do more fine grained optimizations with custom ops lists (some ops on BF16 and some of FP32).

sywangyi · 2025-01-13T01:02:32Z

@deepak-gowda-narayana I see in the PR with torch.autocast(device_type="hpu", dtype=torch.bfloat16, enabled=self.gaudi_config.use_torch_autocast): this logic is removed, it will lead to crash if "--bf16" is not in commandline in Gaudi2D platform which only contains BF16 gemm. error like [INFO|pipeline_stable_diffusion_3.py:569] 2025-01-10 05:26:37,344 >> 1 prompt(s) received, 10 generation(s) per prompt, 1 sample(s) per batch, 10 total batch(es). Traceback (most recent call last): File "/root/optimum-habana/examples/stable-diffusion/text_to_image_generation.py", line 701, in main() File "/root/optimum-habana/examples/stable-diffusion/text_to_image_generation.py", line 664, in main outputs = pipeline(prompt=args.prompts, **kwargs_call) File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/root/optimum-habana/optimum/habana/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3.py", line 610, in call ht.hpu.synchronize() File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/init.py", line 154, in synchronize return _hpu_C.synchronize_device() RuntimeError: synNodeCreateWithId failed for node: batch_gemm with synStatus 26 [Generic failure]. .
@heyuanliu-intel

@sywangyi Thank you for pointing this out. We removed the with torch.autocast(device_type="hpu", dtype=torch.bfloat16, enabled=self.gaudi_config.use_torch_autocast): to allow users to use autocasting (and in addition we saw slight performance boost). Removing --bf16 works without issues on regular G2/G3. Rather than putting a blanket BF16 restriction on all ops, maybe better strategy should be to put a note saying that on some more limited Gaudi devices (e.g. Gaudi2D) please use --bf16 as some ops do not support FP32. This way we still allow users to do more fine grained optimizations with custom ops lists (some ops on BF16 and some of FP32).

do you mean user could use like "PT_HPU_AUTOCAST_LOWER_PRECISION_OPS_LIST=ops_bf16.txt python3 xxxx" even if torch.autocast is not there?

dsocek · 2025-01-13T16:11:21Z

@deepak-gowda-narayana I see in the PR with torch.autocast(device_type="hpu", dtype=torch.bfloat16, enabled=self.gaudi_config.use_torch_autocast): this logic is removed, it will lead to crash if "--bf16" is not in commandline in Gaudi2D platform which only contains BF16 gemm. error like [INFO|pipeline_stable_diffusion_3.py:569] 2025-01-10 05:26:37,344 >> 1 prompt(s) received, 10 generation(s) per prompt, 1 sample(s) per batch, 10 total batch(es). Traceback (most recent call last): File "/root/optimum-habana/examples/stable-diffusion/text_to_image_generation.py", line 701, in main() File "/root/optimum-habana/examples/stable-diffusion/text_to_image_generation.py", line 664, in main outputs = pipeline(prompt=args.prompts, **kwargs_call) File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/root/optimum-habana/optimum/habana/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3.py", line 610, in call ht.hpu.synchronize() File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/init.py", line 154, in synchronize return _hpu_C.synchronize_device() RuntimeError: synNodeCreateWithId failed for node: batch_gemm with synStatus 26 [Generic failure]. .
@heyuanliu-intel

@sywangyi Thank you for pointing this out. We removed the with torch.autocast(device_type="hpu", dtype=torch.bfloat16, enabled=self.gaudi_config.use_torch_autocast): to allow users to use autocasting (and in addition we saw slight performance boost). Removing --bf16 works without issues on regular G2/G3. Rather than putting a blanket BF16 restriction on all ops, maybe better strategy should be to put a note saying that on some more limited Gaudi devices (e.g. Gaudi2D) please use --bf16 as some ops do not support FP32. This way we still allow users to do more fine grained optimizations with custom ops lists (some ops on BF16 and some of FP32).

do you mean user could use like "PT_HPU_AUTOCAST_LOWER_PRECISION_OPS_LIST=ops_bf16.txt python3 xxxx" even if torch.autocast is not there?

@sywangyi Thanks. Yes that is one way, or one can add explicit lists directly via config file like this: https://huggingface.co/Habana/stable-diffusion-2/blob/main/gaudi_config.json, which is then handled in base GaudiDiffusers class: https://github.com/huggingface/optimum-habana/blob/main/optimum/habana/diffusers/pipelines/pipeline_utils.py#L161
I am not clear of if we still need to use with torch.autocast(enable=True) in the specific pipeline for the lists to actually work or this is somehow handled internally by GC (via Runtime Var lists).

We did some testing with Flux pipeline and there we did see different ops being cast internally if we have with torch.autocast(..) vs. if we don't. Better performance was observed without with torch.autocast(..).

Maybe to be safe we could add back the context and set enable to config value but we should not force dtype=torch.bfloat16 unless --bf16 is used. Will do some additional testing and update.

optimum/habana/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3.py

Signed-off-by: Daniel Socek <[email protected]>

deepak-gowda-narayana · 2025-01-21T23:02:17Z

@sywangyi Daniel has made updates to autocasting in SD3, Please re-review PR

skavulya

LGTM

Optimized SD3 pipeline:

43fca5c

* HPU graphs enabled * Batching for inference enabled * Fused SDPA integrated * FP8 quantization enabled Co-authored-by: Daniel Socek <[email protected]>

deepak-gowda-narayana requested a review from regisss as a code owner January 8, 2025 01:11

deepak-gowda-narayana changed the title ~~Optimized SD3 pipeline:~~ Optimized SD3 pipeline Jan 8, 2025

skavulya reviewed Jan 16, 2025

View reviewed changes

deepak-gowda-narayana and others added 2 commits January 21, 2025 20:57

Add lora_scale support

24b4c78

Fix bf16 sdp and autocast context

947d079

Signed-off-by: Daniel Socek <[email protected]>

deepak-gowda-narayana requested a review from skavulya January 21, 2025 22:58

skavulya approved these changes Jan 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimized SD3 pipeline #1682

Optimized SD3 pipeline #1682

deepak-gowda-narayana commented Jan 8, 2025 •

edited

Loading

deepak-gowda-narayana commented Jan 8, 2025

dsocek commented Jan 8, 2025

sywangyi commented Jan 10, 2025

sywangyi commented Jan 10, 2025

dsocek commented Jan 10, 2025

sywangyi commented Jan 13, 2025

dsocek commented Jan 13, 2025

deepak-gowda-narayana commented Jan 21, 2025

skavulya left a comment

Optimized SD3 pipeline #1682

Are you sure you want to change the base?

Optimized SD3 pipeline #1682

Conversation

deepak-gowda-narayana commented Jan 8, 2025 • edited Loading

What does this PR do?

Performance comparison of Pre and Post Optimizations for SD3 Pipeline

Diffusers CI Tests Pass

Before submitting

deepak-gowda-narayana commented Jan 8, 2025

dsocek commented Jan 8, 2025

sywangyi commented Jan 10, 2025

sywangyi commented Jan 10, 2025

dsocek commented Jan 10, 2025

sywangyi commented Jan 13, 2025

dsocek commented Jan 13, 2025

deepak-gowda-narayana commented Jan 21, 2025

skavulya left a comment

Choose a reason for hiding this comment

deepak-gowda-narayana commented Jan 8, 2025 •

edited

Loading