[SPIR-V] Add proposal for I/O builtin in Clang #97

Keenuts · 2024-11-08T14:37:24Z

Initial proposal to implement Input/Output built ins with both semantic & inline SPIR-V

Signed-off-by: Nathan Gauër <[email protected]>

s-perron

I believe that there should be a really high bar for changing the spec in a way that is not backwards compatible. For now, I don't think we should change the spec. However, I do not know the clang code very well. If others who know it better say that changing the spec is necessary, then I will go along with it.

proposals/NNNN-spirv-input-builtin.md

Keenuts · 2024-11-15T15:29:53Z

Thanks, I have published another draft PR and updated this PR to avoid changing the spec.
This should be closer to what DXIL is doing, but will change a bit how the few semantics we already have are implemented (since we now have a generic intrinsic).

New draft PR is llvm/llvm-project#116393

Keenuts · 2024-11-20T16:20:35Z

@tex3d what were your concerns about this design?

proposals/NNNN-spirv-input-builtin.md

s-perron

This looks good to me. The only question is what we will do with the address space. I'll wait until that is finalized before I approve.

s-perron · 2024-11-20T16:35:28Z

@tex3d what were your concerns about this design?

Some of his question were about how well it could be optimized. The two main cases are:

If the input variable is never used, will the call to the load intrinsic be removed?
If the output variable is never actually written it, could the store intrinsic be removed?

Keenuts · 2024-11-20T17:02:05Z

@tex3d what were your concerns about this design?

Some of his question were about how well it could be optimized. The two main cases are:

If the input variable is never used, will the call to the load intrinsic be removed?

If the output variable is never actually written it, could the store intrinsic be removed?

For 1 and 2, I'd assume if we optimize everything else, and just leave a load/store, driver should be able to optimize those away.

However, form the Vulkan spec, an Output BuiltIn starts with an undefined value.
So in this design, if no value is written by the user, a default value will be written to the output in the dtor.
I'd say that's OK, since we replace an undefined behavior with another behavior.

proposals/NNNN-spirv-input-builtin.md

llvm-beanz

This all seems reasonable to me and aligns with how we're handling inputs and outputs in DirectX today.

proposals/NNNN-spirv-input-builtin.md

tex3d

I think input and output are different enough to warrant separate examples and descriptions of behavior. I'm not sure why we would be storing to an input in a destructor or loading from an output in a constructor.

This seems mainly about defining a mechanism for declaring global static variables as built-in shader inputs/outputs, rather than the wrapped entry parameter mechanism. So as a note, the example would be different wrapping a main with an output parameter, and implications for IR (pointer arg instead of value arg) and optimization opportunities are different as well. Perhaps this should only focus on the special global variable attribute, and not the standard HLSL entry parameter case.

I am a bit concerned how we specify that variables only read from or written to in control flow preserve these intended load/store locations. Unless you are always expecting unconditional load of input before shader and unconditional store of output after shader, this scheme relies on an unspoken assumption that optimizations can translate these cases:

From: unconditional load of built-in before entry and store to static global var, then conditional load from static global under control flow
To: conditional load of built-in under control flow.

From: conditional store to static global var under control flow, then unconditionally load static global after entry and store to built-in.
To: conditional store to built-in under control flow.

Additionally, if a built-in is never read or written, the load or store should be eliminated. Eliminating the store seems particularly problematic. This is one reason I think inputs and outputs need to be explored separately.

This proposal doesn't seem to discuss what's required for this to work properly.

I don't know of a language construct that expresses these semantics clearly without special-handling for these globals. Without a description of any required special language and compiler implementation implications, it's hard to evaluate whether the proposal can solve these issues when using a constructor/destructor paradigm.

I'm not a fan of the language semantics being significantly changed for a static global variable decl by using an attribute, but I realize that's the existing construct implemented in DXC, so I guess we have to support it somehow.

tex3d · 2024-11-26T01:39:50Z

@tex3d what were your concerns about this design?

Some of his question were about how well it could be optimized. The two main cases are:

If the input variable is never used, will the call to the load intrinsic be removed?

If the output variable is never actually written it, could the store intrinsic be removed?

For 1 and 2, I'd assume if we optimize everything else, and just leave a load/store, driver should be able to optimize those away.

However, form the Vulkan spec, an Output BuiltIn starts with an undefined value. So in this design, if no value is written by the user, a default value will be written to the output in the dtor. I'd say that's OK, since we replace an undefined behavior with another behavior.

The never-use case is the simplest to resolve with special handling, but I think these should be guaranteed to be eliminated before SPIR-V, otherwise they could be illegal, since you're adding everything declared globally to any entry compiled from the file (you could have a built-in that isn't valid in another entry point type).

I'm still not sure how you reliably move accesses to the correct control-flow locations without a special pass or something. If you are going to use a custom pass for these anyway, why bother with the constructor/destructor mechanism in the first place, since that will just obfuscate the real access locations which would be much easier and more reliable to translate directly from loads/stores of the original global variable?

Adding a special constructor/destructor to initialize/store a global variable when an attribute is present feels like a weird hack from the language semantics perspective already.

IMHO, this attribute should be deprecated in 202x and replaced with something cleaner for 202y.

Keenuts · 2024-11-26T10:11:24Z

I think input and output are different enough to warrant separate examples and descriptions of behavior. I'm not sure why we would be storing to an input in a destructor or loading from an output in a constructor.

Agree, I'll modify the proposal to remove dtor for input as those are not useful.
For ctor for output, see part below about storing undef.

I am a bit concerned how we specify that variables only read from or written to in control flow preserve these intended load/store locations.

Ok I think I got it. The concern is: what if a lane conditionally stores to a built-in.
Since store to the builtin is done at the end, you worry the undefined value in the output will be changed in all paths, since the built-in store is now unconditional.

This specific issue can be solved by having a ctor on output builtins.

load undefined value in the global
shader modifies it or not in some control flow.
store back the value, which is either the original undefined, or a new value.

Keenuts · 2024-11-26T10:52:24Z

@tex3d answered to your concerns, and changed the spec to share more details around those specifics handling of inputs/outputs.
Let me know if I missed something!

Keenuts · 2024-11-27T13:51:47Z

@tex3d so Steven found a case which hinted as something incomplete: the PointSize BuiltIn.
Unlike this others, this one is assumed to have the value 1.0 if not stored to. But question was: would a load/store pair be OK.

The answer is actually not that clear.
In SPIR-V, those yield the same result:

Load, Store
Store 1.0
do nothing
All 3 options gives us a point size of 1.0.
And all are translated in a reg = 1.f in the final ISA I have access to (AMD RNDA).

But what's interesting is:

a = load(), store(a + 123)
This specific case yields a reg = NaN. Which make me believe AMD has a special case for store(load()), but otherwise, a load yields an undefined value, hence why we have a NaN.

Additional weird detail:

printf("%f, %x", point_size, asuint(point_size)); // Prints "0, 0".
point_size = point_size + 10.f; // Yields point_size = NaN.

So this hints in the direction that the actual store could be assumed to have a hidden side-effect, and thus I shall not emit a load/store pair if the user hasn't written to the builtin.
So this means if a branch stores a value, and another doesn't, I cannot simple store in both.

Need to think how to solve this.

Signed-off-by: Nathan Gauër <[email protected]>

Keenuts · 2024-11-28T16:17:04Z

Hello all!
So for the reasons outlined above (builtin store has an hidden side-effect), I changed the whole design to implement this as regular global variables, and remove the ctor/dtor/load/store builtins.

s-perron

From a SPIR-V perspective this looks good to me.

proposals/NNNN-spirv-input-builtin.md

Signed-off-by: Nathan Gauër <[email protected]>

Keenuts · 2024-12-02T10:27:29Z

Thanks for the review, feedback applied!

[SPIR-V] Add proposal for I/O builtin in Clang

aaaf73c

Signed-off-by: Nathan Gauër <[email protected]>

s-perron reviewed Nov 8, 2024

View reviewed changes

damyanp requested a review from tex3d November 8, 2024 18:56

revert spec change

4a66994

Keenuts requested a review from s-perron November 15, 2024 15:29

s-perron reviewed Nov 20, 2024

View reviewed changes