-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Return buffer write barrier trade-off #111127
Comments
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
Note that the caller can be either JITed code or "manually managed" code in the runtime, like debugger funceval. If we were to change the calling convention, all these places would have to be updated as necessary. |
PGO type tracking is per call site. You might be thinking of VSD, which I believe is per type. |
How big is the problem, any real-world use-case? Presumably, the write-barrier should bail out early on "is not on heap" check |
Thanks, my misunderstanding.
Yes a real world postgres driver. Tested against TE fortunes, but this would apply to actual applications just as well, including backends that I own. I've marked two hot path property getters as no inline, these return message structs and total RPS decreased by 1% as a result. These are purely stack only barriers, so either this g_lowest_address, g_highest_address is constantly evicted from cache or something else is going on. Either way the cheap 'in heap' check isn't. I'm aware the barriers differ slightly between NativeAOT and CoreCLR but the perf differences seem comparable. The fortunes benchmark also includes http handling, sorting, html encoding and template rendering in this RPS total; isolating the change in a small message stream repro would obviously make it a much larger portion of the total than the 1% here. Conversely, as this is a manual change, having the JIT apply the change broadly should make the total effect much larger, as I have plenty of other methods, including those that can't be inlined today that include this fixed cost. The profiler attributes 60% of total time of the getter to the barrier, remaining cost looks to be the stack frame. (checked barrier instruction profile shown below) (instruction profile of an actual heap writing barrier) Related, @EgorBo ref struct assignments to managed references should get their write barriers elided by default. using System;
public class Program
{
object _obj = new();
Test M() => new Test() { Obj = _obj, Obj2 = _obj, Obj3= _obj };
void M1(ref Test ret) => ret = new Test() { Obj = _obj, Obj1 = _obj, Obj2= _obj };
void M2(out Test ret) => ret = new Test() { Obj = _obj, Obj1 = _obj, Obj2= _obj };
public ref struct Test {
public object Obj;
public object Obj1;
public object Obj2;
}
} I found #103503 but it seems to be limited to field writes on ref structs. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
I wonder if first step could be a some sort of analysis in jit where it can count call sites with stack hidden buffer vs heap. Something along the lines of pgo profile. And if there is, say, 90% usage of stack byffer vs heap buffer then 2 versions of a method get created: one with a write barrier and one without? |
The runtime does not support having two copies of the same method with different calling conventions today. It would be a big lift to build this support just for this. A first step can be a prototype that implements the change in the JIT and look at the code-diffs. If we like the new tradeoff, implement the rest of the required changes in the runtime. |
I am just curious, is it arm64 only or reproduces on x64 as well?
Thanks! I'll look into these |
I would be quite curious about the diffs for NAOT/No tiering, as PGO does help to make up for some of the current costs.
Am I correct in my understanding that there is slightly more work done in the NativeAOT variants for these indirect loads as well?
I have not done a lot of testing on x64, I'll try to check the perf diff on my ryzen 4 pc later, it needs some setup. |
Yes, it's a bit less efficient on NativeAOT (it doesn't patch actual code of WB) |
I've pushed an experiment where JIT inlines "Is on gc heap" check into codegen - it allows us to skip call overhead and do stack copy more efficiently (because we precisely know its size) - #111561 there can be made further improvements to mitigate regressions in case if the return buffer is pointing to heap. |
Methods returning struct types do so via the stack, writing the struct to the hidden return buffer reference passed by the caller. As this reference is opaque to the callee the JIT must emit write barriers *if the struct contains references* in case it points to the heap. Though as long as the JIT manages to inline the callee - if it's not out of budget, blocked by virtual calls or EH etc. - it is able to elide these barriers when it can see the return buffer reference is definitely pointing to the stack.
My theory is that the current tradeoff to treat hidden return buffer references as managed is the wrong one for modern .NET code, where returning structs (including those containing references) has become much more common. It's an essential tool for high performance and low allocation code.
Any sort of 'good practice' code structuring, separating concerns for clarity or introducing an abstraction boundary can easily introduce new stack frames. This adds write barrier costs for each additional frame, assuming some progressive handling of the same result is performed. Network protocols are a good example where it's natural to want high performance, no allocation per message, and progressive handling of concerns (framing, parsing, handling, out-of-band handling, etc.) spread across dependent methods, causing successive barrier costs without touching the heap. Profiling this on arm64 - where write barriers are already more expensive (tracked in #109652) - on messages that are simple to parse and commonly small in size, write barriers accounted for an unfortunate portion of the total time spent.
Nested opaque IEnumerables are another good example where
IEnumerator<T>.Current
calls can cause successive barriers costs. As I understand it PGO guarded devirt is not a huge help here, it does concrete type tracking not per call-site but instead per type across all virtual calls involving it. For megamorphic interfaces like IEnumerable it's extremely unlikely the most globally common type(s) - which the JIT could then lead down the optimized inlined path - will be selected more than once in a nested enumerator call stack.I've gathered that return buffer references are only managed to allow caller code to directly refer to a field stored on the heap. If this could no longer be optimized (i.e. require a stack to heap copy) but the trade-off would be that all returns no longer introduce write barriers, my expectation would be that this leads to an overall performance improvement.
If it turns out to be a wash the trade-off might still be beneficial for predictability. It was surprising to me (and others /cc @neon-sunset #92662) that stack only code involves write barriers at all. It's an unobvious performance surprise when this kind of code is structured and written in the obvious way. Much more so than an additional copy for a heap field assignment from a callee return value would surprise me, as such code visibly communicates it operates across the heap and stack.
The text was updated successfully, but these errors were encountered: