Return buffer write barrier trade-off #111127

NinoFloris · 2025-01-06T18:41:16Z

Methods returning struct types do so via the stack, writing the struct to the hidden return buffer reference passed by the caller. As this reference is opaque to the callee the JIT must emit write barriers *if the struct contains references* in case it points to the heap. Though as long as the JIT manages to inline the callee - if it's not out of budget, blocked by virtual calls or EH etc. - it is able to elide these barriers when it can see the return buffer reference is definitely pointing to the stack.

My theory is that the current tradeoff to treat hidden return buffer references as managed is the wrong one for modern .NET code, where returning structs (including those containing references) has become much more common. It's an essential tool for high performance and low allocation code.

Any sort of 'good practice' code structuring, separating concerns for clarity or introducing an abstraction boundary can easily introduce new stack frames. This adds write barrier costs for each additional frame, assuming some progressive handling of the same result is performed. Network protocols are a good example where it's natural to want high performance, no allocation per message, and progressive handling of concerns (framing, parsing, handling, out-of-band handling, etc.) spread across dependent methods, causing successive barrier costs without touching the heap. Profiling this on arm64 - where write barriers are already more expensive (tracked in #109652) - on messages that are simple to parse and commonly small in size, write barriers accounted for an unfortunate portion of the total time spent.

Nested opaque IEnumerables are another good example where IEnumerator<T>.Current calls can cause successive barriers costs. As I understand it PGO guarded devirt is not a huge help here, it does concrete type tracking not per call-site but instead per type across all virtual calls involving it. For megamorphic interfaces like IEnumerable it's extremely unlikely the most globally common type(s) - which the JIT could then lead down the optimized inlined path - will be selected more than once in a nested enumerator call stack.

I've gathered that return buffer references are only managed to allow caller code to directly refer to a field stored on the heap. If this could no longer be optimized (i.e. require a stack to heap copy) but the trade-off would be that all returns no longer introduce write barriers, my expectation would be that this leads to an overall performance improvement.

If it turns out to be a wash the trade-off might still be beneficial for predictability. It was surprising to me (and others /cc @neon-sunset #92662) that stack only code involves write barriers at all. It's an unobvious performance surprise when this kind of code is structured and written in the obvious way. Much more so than an additional copy for a heap field assignment from a callee return value would surprise me, as such code visibly communicates it operates across the heap and stack.

The text was updated successfully, but these errors were encountered:

dotnet-policy-service · 2025-01-06T18:51:59Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

jkotas · 2025-01-06T18:57:48Z

I've gathered that return buffer references are only managed to allow caller code to directly refer to a field stored on the heap.

Note that the caller can be either JITed code or "manually managed" code in the runtime, like debugger funceval. If we were to change the calling convention, all these places would have to be updated as necessary.

AndyAyersMS · 2025-01-06T19:40:48Z

As I understand it PGO guarded devirt is not a huge help here, it does concrete type tracking not per call-site but instead per type across all virtual calls involving it.

PGO type tracking is per call site. You might be thinking of VSD, which I believe is per type.

EgorBo · 2025-01-06T20:03:18Z

How big is the problem, any real-world use-case? Presumably, the write-barrier should bail out early on "is not on heap" check

NinoFloris · 2025-01-07T14:15:57Z

PGO type tracking is per call site. You might be thinking of VSD, which I believe is per type.

Thanks, my misunderstanding.

How big is the problem, any real-world use-case? Presumably, the write-barrier should bail out early on "is not on heap" check

Yes a real world postgres driver. Tested against TE fortunes, but this would apply to actual applications just as well, including backends that I own.

I've marked two hot path property getters as no inline, these return message structs and total RPS decreased by 1% as a result. These are purely stack only barriers, so either this g_lowest_address, g_highest_address is constantly evicted from cache or something else is going on. Either way the cheap 'in heap' check isn't. I'm aware the barriers differ slightly between NativeAOT and CoreCLR but the perf differences seem comparable.

The fortunes benchmark also includes http handling, sorting, html encoding and template rendering in this RPS total; isolating the change in a small message stream repro would obviously make it a much larger portion of the total than the 1% here. Conversely, as this is a manual change, having the JIT apply the change broadly should make the total effect much larger, as I have plenty of other methods, including those that can't be inlined today that include this fixed cost.

The profiler attributes 60% of total time of the getter to the barrier, remaining cost looks to be the stack frame.

(checked barrier instruction profile shown below)

(instruction profile of an actual heap writing barrier)

Related, @EgorBo ref struct assignments to managed references should get their write barriers elided by default.
Just paste this into godbolt to see them appear in all cases:

using System;

public class Program
{
    object _obj = new();

    Test M() => new Test() { Obj = _obj, Obj2 = _obj, Obj3= _obj };
    void M1(ref Test ret) => ret = new Test() { Obj = _obj, Obj1 = _obj, Obj2= _obj };
    void M2(out Test ret) => ret = new Test() { Obj = _obj, Obj1 = _obj, Obj2= _obj };

    public ref struct Test {
        public object Obj; 
        public object Obj1; 
        public object Obj2; 
    }
}

I found #103503 but it seems to be limited to field writes on ref structs.

En3Tho · 2025-01-07T17:10:06Z

I wonder if first step could be a some sort of analysis in jit where it can count call sites with stack hidden buffer vs heap. Something along the lines of pgo profile. And if there is, say, 90% usage of stack byffer vs heap buffer then 2 versions of a method get created: one with a write barrier and one without?

jkotas · 2025-01-07T18:25:15Z

2 versions of a method get created: one with a write barrier and one without?

The runtime does not support having two copies of the same method with different calling conventions today. It would be a big lift to build this support just for this.

A first step can be a prototype that implements the change in the JIT and look at the code-diffs. If we like the new tradeoff, implement the rest of the required changes in the runtime.

EgorBo · 2025-01-07T19:21:23Z

I've marked two hot path property getters as no inline, these return message structs and total RPS decreased by 1% as a result. These are purely stack only barriers, so either this g_lowest_address, g_highest_address is constantly evicted from cache or something else is going on. Either way the cheap 'in heap' check isn't. I'm aware the barriers differ slightly between NativeAOT and CoreCLR but the perf differences seem comparable.

I am just curious, is it arm64 only or reproduces on x64 as well?
"is on heap" is cheaper on CoreCLR + x64 as it doesn't even need any indirect loads, just compares the address against two 64bit (patchable) constants.

Just paste this into godbolt to see them appear in all cases:

Thanks! I'll look into these

NinoFloris · 2025-01-07T19:56:35Z

A first step can be a prototype that implements the change in the JIT and look at the code-diffs. If we like the new tradeoff, implement the rest of the required changes in the runtime.

I would be quite curious about the diffs for NAOT/No tiering, as PGO does help to make up for some of the current costs.

"is on heap" is cheaper on CoreCLR + x64 as it doesn't even need any indirect loads, just compares the address against two 64bit (patchable) constants.

Am I correct in my understanding that there is slightly more work done in the NativeAOT variants for these indirect loads as well?

I am just curious, is it arm64 only or reproduces on x64 as well?

I have not done a lot of testing on x64, I'll try to check the perf diff on my ryzen 4 pc later, it needs some setup.

EgorBo · 2025-01-07T20:01:38Z

Am I correct in my understanding that there is slightly more work done in the NativeAOT variants for these indirect loads as well?

Yes, it's a bit less efficient on NativeAOT (it doesn't patch actual code of WB)
https://github.com/dotnet/runtime/blob/main/src/coreclr/nativeaot/Runtime/amd64/WriteBarriers.S#L103-L107

EgorBo · 2025-01-18T14:21:50Z

I've pushed an experiment where JIT inlines "Is on gc heap" check into codegen - it allows us to skip call overhead and do stack copy more efficiently (because we precisely know its size) - #111561

there can be made further improvements to mitigate regressions in case if the return buffer is pointing to heap.

NinoFloris added the tenet-performance Performance related issue label Jan 6, 2025

dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Jan 6, 2025

dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Jan 6, 2025

jkotas added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Jan 6, 2025

This comment has been minimized.

Sign in to view

EgorBo mentioned this issue Jan 18, 2025

Redundant write barriers around byref-like structs #111575

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return buffer write barrier trade-off #111127

Return buffer write barrier trade-off #111127

NinoFloris commented Jan 6, 2025 •

edited

Loading

dotnet-policy-service bot commented Jan 6, 2025

jkotas commented Jan 6, 2025

AndyAyersMS commented Jan 6, 2025

EgorBo commented Jan 6, 2025

NinoFloris commented Jan 7, 2025

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

En3Tho commented Jan 7, 2025 •

edited

Loading

jkotas commented Jan 7, 2025

EgorBo commented Jan 7, 2025

NinoFloris commented Jan 7, 2025

EgorBo commented Jan 7, 2025 •

edited

Loading

EgorBo commented Jan 18, 2025

Return buffer write barrier trade-off #111127

Return buffer write barrier trade-off #111127

Comments

NinoFloris commented Jan 6, 2025 • edited Loading

dotnet-policy-service bot commented Jan 6, 2025

jkotas commented Jan 6, 2025

AndyAyersMS commented Jan 6, 2025

EgorBo commented Jan 6, 2025

NinoFloris commented Jan 7, 2025

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

En3Tho commented Jan 7, 2025 • edited Loading

jkotas commented Jan 7, 2025

EgorBo commented Jan 7, 2025

NinoFloris commented Jan 7, 2025

EgorBo commented Jan 7, 2025 • edited Loading

EgorBo commented Jan 18, 2025

NinoFloris commented Jan 6, 2025 •

edited

Loading

En3Tho commented Jan 7, 2025 •

edited

Loading

EgorBo commented Jan 7, 2025 •

edited

Loading