-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can slicing a Memory<T> be made faster by the JIT by eliding the call to write barrier #12414
Comments
What results are you getting? |
Sorry, forgot to add that. Updated. I can provide the perf numbers of slicing in the utf8jsonwriter.advance as well (it becomes quite expensive since that method is called a lot - ~10% slower). As a workaround, I would not slice at all and store the index instead. |
In the benchmark, the write barrier at the end is for copying into the return value struct. I don't see how the jit can omit that, since the jit does not know where that struct is located. A more representative test that mimics For the case of |
In particular, this pattern is the problematic one: someObj._memField = someObj._memField.Slice(5); The Ideally JIT's control flow analysis would be able to recognize and optimize this pattern so all callers would get the benefit automatically. We could also solve this by introducing an API like |
this pattern is used in a bunch of places in ASP.NET Core and in the default Pipe implementation itself (basically reference types that need to store Memory<byte> instead of Span because they are heapable): |
My concern is something like the following... maybe it is overblown? Since the Imagine initially that So to safely elide the barrier it seems to me that the jit needs to know that either (a) no multi-threaded access to |
I think you would need to elide the write completely: The JIT would need to figure out that the value about to be written is same as what was read just recently and omit the write. |
If there was no GC between read and write, the only thing another thread could do wrt gc state is set card table bits, and we allow these to be an overstatement of the true old->new ref set (eg we don't ever clear card table bits via codegen, or use a write barrier when writing null). So the card table bits are either compatible with the value that was read, or are a safe overstatement. And so, the write does not need a barrier. Assuming the above thinking is correct, it would be useful in other cases too, like the array element swap idiom. Proving there can be no other thread interference is not going to be possible for the jit, in general. The only way to convey this currently is for the user code to copy the struct to a local operate on the local copy, then propagate the local back to the original after some sequence of operations. The operations need not all be visible in the local's declared scope, if it is a ref struct. But that pattern is likely not compatible with the current chatty API, I would guess there are few opportunities to amortize cost over a sequence. Adding a method like I don't see anything we can do here in the short term, so moving this to future. |
Adding public static void SliceInPlace(ref Memory<T> memory, int startIndex)
{
// update memory._index and memory._length while leaving memory._object as-is
} It would still fully "safe" (barring threading issues, but we make no guarantees there anyway), so from a safety + reliability perspective I don't have issue with such an API. It just strikes me as unclean. That's why I'd only recommend it as a backup if a JIT change isn't feasible. |
Is the performance benefit worth the API addition? |
I don't think we have evidence for that yet. Our measurements so far are for microbenchmarks, but that doesn't mean it'll show up in end-to-end benchmarks. |
|
We say that and at the same time are changing the JsonWriter to a class which means we need to store a Memory<T> (or T[]). Would you argue that it's better to use an array an index rather than a |
I think so. |
Instead of eliding it, could there be a branch for equality of the 'to be written reference' before calling the write barrier? Somewhat similar to the is in heap check, or how it does not write to the card table if it's already set? Would doing a (volatile?) read even be a worthwhile cost to pay to reduce the cost of these kind of assignments? Maybe the JIT is able to help here by emitting the equality check barrier only if it can conclude the source location is the same as the destination location? Doing this at all might be tricky considering the memory model we have but it may be ok to introduce this read in the write barrier? Not doing the write on equality could change existing data races and tearing to manifest themselves in slightly different ways. So I'm not sure... It's an idea. |
Slicing a memory does not modify the underlying object it points to. It only modifies the index/length. However, it does create a new memory and hence has to set the object field.
This causes memory slicing to be slower since the JIT injects a call to write barrier. It makes methods like below no longer leaf methods, and hence a stack frame gets added as well.
Can the JIT elide the call to the write barrier here? Are there concerns with the GC moving the object?
From @AndyAyersMS:
Disassembly:
https://www.diffchecker.com/zbiwhYXM
Note the call here:
call coreclr!coreclr_shutdown_2+0xc910
I believe this is
CoreCLR!JIT_WriteBarrier
Benchmark:
cc @AndyAyersMS, @CarolEidt, @davidfowl, @jkotas
category:cq
theme:barriers
skill-level:expert
cost:large
The text was updated successfully, but these errors were encountered: