Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can slicing a Memory<T> be made faster by the JIT by eliding the call to write barrier #12414

Open
ahsonkhan opened this issue Apr 4, 2019 · 15 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI JitUntriaged CLR JIT issues needing additional triage tenet-performance Performance related issue
Milestone

Comments

@ahsonkhan
Copy link
Member

Slicing a memory does not modify the underlying object it points to. It only modifies the index/length. However, it does create a new memory and hence has to set the object field.

This causes memory slicing to be slower since the JIT injects a call to write barrier. It makes methods like below no longer leaf methods, and hence a stack frame gets added as well.

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private void Advance(int count)
{
    Debug.Assert(count >= 0 && _buffered <= int.MaxValue - count);
    _buffered += count;

    // Unsafe, do not do this.
    //UnsafeMemory<byte> temp = Unsafe.As<Memory<byte>, UnsafeMemory<byte>>(ref _buffer).Slice(count);
    //_buffer = Unsafe.As<UnsafeMemory<byte>, Memory<byte>>(ref temp);

    _buffer = _buffer.Slice(count);
}

Can the JIT elide the call to the write barrier here? Are there concerns with the GC moving the object?

From @AndyAyersMS:

Though perhaps it doesn't matter as the pointer the jit has and the pointer the struct has should have the same value whether or not GC happens in between

Disassembly:
https://www.diffchecker.com/zbiwhYXM
Note the call here:
call coreclr!coreclr_shutdown_2+0xc910

I believe this is CoreCLR!JIT_WriteBarrier

00007ff9`65e76820 System.Text.Json.Perf_MemSlice.MemSlice()
00007ff9`65e76820 57              push    rdi
00007ff9`65e76821 56              push    rsi
00007ff9`65e76822 53              push    rbx
00007ff9`65e76823 4883ec20        sub     rsp,20h
00007ff9`65e76827 488bf2          mov     rsi,rdx
00007ff9`65e7682a 3909            cmp     dword ptr [rcx],ecx
00007ff9`65e7682c 4883c110        add     rcx,10h
00007ff9`65e76830 8b790c          mov     edi,dword ptr [rcx+0Ch]
00007ff9`65e76833 83ff01          cmp     edi,1
00007ff9`65e76836 7262            jb      00007ff9`65e7689a
00007ff9`65e76838 488b11          mov     rdx,qword ptr [rcx]
00007ff9`65e7683b 8b5908          mov     ebx,dword ptr [rcx+8]
00007ff9`65e7683e ffc3            inc     ebx
00007ff9`65e76840 ffcf            dec     edi
00007ff9`65e76842 83ff01          cmp     edi,1
00007ff9`65e76845 725e            jb      00007ff9`65e768a5
00007ff9`65e76847 ffc3            inc     ebx
00007ff9`65e76849 ffcf            dec     edi
00007ff9`65e7684b 83ff01          cmp     edi,1
00007ff9`65e7684e 7260            jb      00007ff9`65e768b0
00007ff9`65e76850 ffc3            inc     ebx
00007ff9`65e76852 ffcf            dec     edi
00007ff9`65e76854 83ff01          cmp     edi,1
00007ff9`65e76857 7262            jb      00007ff9`65e768bb
00007ff9`65e76859 ffc3            inc     ebx
00007ff9`65e7685b ffcf            dec     edi
00007ff9`65e7685d 83ff01          cmp     edi,1
00007ff9`65e76860 7264            jb      00007ff9`65e768c6
00007ff9`65e76862 ffc3            inc     ebx
00007ff9`65e76864 ffcf            dec     edi
00007ff9`65e76866 83ff01          cmp     edi,1
00007ff9`65e76869 7266            jb      00007ff9`65e768d1
00007ff9`65e7686b ffc3            inc     ebx
00007ff9`65e7686d ffcf            dec     edi
00007ff9`65e7686f 83ff01          cmp     edi,1
00007ff9`65e76872 7268            jb      00007ff9`65e768dc
00007ff9`65e76874 ffc3            inc     ebx
00007ff9`65e76876 ffcf            dec     edi
00007ff9`65e76878 83ff01          cmp     edi,1
00007ff9`65e7687b 726a            jb      00007ff9`65e768e7
00007ff9`65e7687d ffc3            inc     ebx
00007ff9`65e7687f ffcf            dec     edi
00007ff9`65e76881 488bce          mov     rcx,rsi
00007ff9`65e76884 e887dd8e5f      call    coreclr!coreclr_shutdown_2+0xc910 (00007ff9`c5764610)
00007ff9`65e76889 895e08          mov     dword ptr [rsi+8],ebx
00007ff9`65e7688c 897e0c          mov     dword ptr [rsi+0Ch],edi
00007ff9`65e7688f 488bc6          mov     rax,rsi
00007ff9`65e76892 4883c420        add     rsp,20h
00007ff9`65e76896 5b              pop     rbx
00007ff9`65e76897 5e              pop     rsi
00007ff9`65e76898 5f              pop     rdi
00007ff9`65e76899 c3              **ret**
00007ff9`65e56af0 System.Text.Json.Perf_MemSlice.MemUnsafeSlice()
00007ff9`65e56af0 4883ec28        sub     rsp,28h
00007ff9`65e56af4 90              nop
00007ff9`65e56af5 3909            cmp     dword ptr [rcx],ecx
00007ff9`65e56af7 4883c110        add     rcx,10h
00007ff9`65e56afb 8b410c          mov     eax,dword ptr [rcx+0Ch]
00007ff9`65e56afe 83f801          cmp     eax,1
00007ff9`65e56b01 725a            jb      00007ff9`65e56b5d
00007ff9`65e56b03 4c8b01          mov     r8,qword ptr [rcx]
00007ff9`65e56b06 8b4908          mov     ecx,dword ptr [rcx+8]
00007ff9`65e56b09 ffc1            inc     ecx
00007ff9`65e56b0b ffc8            dec     eax
00007ff9`65e56b0d 83f801          cmp     eax,1
00007ff9`65e56b10 7251            jb      00007ff9`65e56b63
00007ff9`65e56b12 ffc1            inc     ecx
00007ff9`65e56b14 ffc8            dec     eax
00007ff9`65e56b16 83f801          cmp     eax,1
00007ff9`65e56b19 724e            jb      00007ff9`65e56b69
00007ff9`65e56b1b ffc1            inc     ecx
00007ff9`65e56b1d ffc8            dec     eax
00007ff9`65e56b1f 83f801          cmp     eax,1
00007ff9`65e56b22 724b            jb      00007ff9`65e56b6f
00007ff9`65e56b24 ffc1            inc     ecx
00007ff9`65e56b26 ffc8            dec     eax
00007ff9`65e56b28 83f801          cmp     eax,1
00007ff9`65e56b2b 7248            jb      00007ff9`65e56b75
00007ff9`65e56b2d ffc1            inc     ecx
00007ff9`65e56b2f ffc8            dec     eax
00007ff9`65e56b31 83f801          cmp     eax,1
00007ff9`65e56b34 7245            jb      00007ff9`65e56b7b
00007ff9`65e56b36 ffc1            inc     ecx
00007ff9`65e56b38 ffc8            dec     eax
00007ff9`65e56b3a 83f801          cmp     eax,1
00007ff9`65e56b3d 7242            jb      00007ff9`65e56b81
00007ff9`65e56b3f ffc1            inc     ecx
00007ff9`65e56b41 ffc8            dec     eax
00007ff9`65e56b43 83f801          cmp     eax,1
00007ff9`65e56b46 723f            jb      00007ff9`65e56b87
00007ff9`65e56b48 ffc1            inc     ecx
00007ff9`65e56b4a ffc8            dec     eax
00007ff9`65e56b4c 4c8902          mov     qword ptr [rdx],r8
00007ff9`65e56b4f 894a08          mov     dword ptr [rdx+8],ecx
00007ff9`65e56b52 89420c          mov     dword ptr [rdx+0Ch],eax
00007ff9`65e56b55 488bc2          mov     rax,rdx
00007ff9`65e56b58 4883c428        add     rsp,28h
00007ff9`65e56b5c c3              ret
00007ff9`65e46af0 System.Text.Json.Perf_MemSlice.SpanSlice()
00007ff9`65e46af0 4883ec28        sub     rsp,28h
00007ff9`65e46af4 90              nop
00007ff9`65e46af5 488b4108        mov     rax,qword ptr [rcx+8]
00007ff9`65e46af9 4885c0          test    rax,rax
00007ff9`65e46afc 7465            je      00007ff9`65e46b63
00007ff9`65e46afe 8b4808          mov     ecx,dword ptr [rax+8]
00007ff9`65e46b01 83f901          cmp     ecx,1
00007ff9`65e46b04 7263            jb      00007ff9`65e46b69
00007ff9`65e46b06 4883c010        add     rax,10h
00007ff9`65e46b0a ffc9            dec     ecx
00007ff9`65e46b0c 48ffc0          inc     rax
00007ff9`65e46b0f 83f901          cmp     ecx,1
00007ff9`65e46b12 725b            jb      00007ff9`65e46b6f
00007ff9`65e46b14 ffc9            dec     ecx
00007ff9`65e46b16 48ffc0          inc     rax
00007ff9`65e46b19 83f901          cmp     ecx,1
00007ff9`65e46b1c 7257            jb      00007ff9`65e46b75
00007ff9`65e46b1e ffc9            dec     ecx
00007ff9`65e46b20 48ffc0          inc     rax
00007ff9`65e46b23 83f901          cmp     ecx,1
00007ff9`65e46b26 7253            jb      00007ff9`65e46b7b
00007ff9`65e46b28 ffc9            dec     ecx
00007ff9`65e46b2a 48ffc0          inc     rax
00007ff9`65e46b2d 83f901          cmp     ecx,1
00007ff9`65e46b30 724f            jb      00007ff9`65e46b81
00007ff9`65e46b32 ffc9            dec     ecx
00007ff9`65e46b34 48ffc0          inc     rax
00007ff9`65e46b37 83f901          cmp     ecx,1
00007ff9`65e46b3a 724b            jb      00007ff9`65e46b87
00007ff9`65e46b3c ffc9            dec     ecx
00007ff9`65e46b3e 48ffc0          inc     rax
00007ff9`65e46b41 83f901          cmp     ecx,1
00007ff9`65e46b44 7247            jb      00007ff9`65e46b8d
00007ff9`65e46b46 ffc9            dec     ecx
00007ff9`65e46b48 48ffc0          inc     rax
00007ff9`65e46b4b 83f901          cmp     ecx,1
00007ff9`65e46b4e 7243            jb      00007ff9`65e46b93
00007ff9`65e46b50 ffc9            dec     ecx
00007ff9`65e46b52 48ffc0          inc     rax
00007ff9`65e46b55 488902          mov     qword ptr [rdx],rax
00007ff9`65e46b58 894a08          mov     dword ptr [rdx+8],ecx
00007ff9`65e46b5b 488bc2          mov     rax,rdx
00007ff9`65e46b5e 4883c428        add     rsp,28h
00007ff9`65e46b62 c3              ret

Benchmark:

    public class Perf_MemSlice
    {
        byte[] _array;
        Memory<byte> _memory;

        [GlobalSetup]
        public void Setup()
        {
            _array = new byte[128];
            _memory = _array;
        }

        [Benchmark]
        public Memory<byte> MemSlice()
        {
            Memory<byte> memory = _memory.Slice(1);
            memory = memory.Slice(1);
            memory = memory.Slice(1);
            memory = memory.Slice(1);

            memory = memory.Slice(1);
            memory = memory.Slice(1);
            memory = memory.Slice(1);
            memory = memory.Slice(1);
            return memory;
        }

        [Benchmark]
        public Memory<byte> MemUnsafeSlice()
        {
            UnsafeMemory<byte> memory = Unsafe.As<Memory<byte>, UnsafeMemory<byte>>(ref _memory).Slice(1);
            memory = memory.Slice(1);
            memory = memory.Slice(1);
            memory = memory.Slice(1);

            memory = memory.Slice(1);
            memory = memory.Slice(1);
            memory = memory.Slice(1);
            memory = memory.Slice(1);
            return Unsafe.As<UnsafeMemory<byte>, Memory<byte>>(ref memory);
        }

        [Benchmark]
        public Span<byte> SpanSlice()
        {
            Span<byte> span = _array.AsSpan(1);
            span = span.Slice(1);
            span = span.Slice(1);
            span = span.Slice(1);

            span = span.Slice(1);
            span = span.Slice(1);
            span = span.Slice(1);
            span = span.Slice(1);
            return span;
        }
    }

    // Unsafe hack to measure what perf we can get if we don't have the write barrier.
    // Courtesy of Levi
    public struct UnsafeMemory<T>
    {
        private readonly IntPtr _object;
        private readonly int _index;
        private readonly int _length;

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public UnsafeMemory<T> Slice(int start)
        {
            if ((uint)start > (uint)_length)
            {
                ThrowArgumentOutOfRangeException();
            }

            // It is expected for _index + start to be negative if the memory is already pre-pinned.
            return new UnsafeMemory<T>(_object, _index + start, _length - start);
        }

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        internal UnsafeMemory(IntPtr obj, int start, int length)
        {
            // No validation performed in release builds; caller must provide any necessary validation.

            _object = obj;
            _index = start;
            _length = length;
        }

        internal static void ThrowArgumentOutOfRangeException()
        {
            throw new ArgumentOutOfRangeException();
        }
    }

image

cc @AndyAyersMS, @CarolEidt, @davidfowl, @jkotas

category:cq
theme:barriers
skill-level:expert
cost:large

@stephentoub
Copy link
Member

Benchmark

What results are you getting?

@ahsonkhan
Copy link
Member Author

ahsonkhan commented Apr 4, 2019

What results are you getting?

Sorry, forgot to add that. Updated.

I can provide the perf numbers of slicing in the utf8jsonwriter.advance as well (it becomes quite expensive since that method is called a lot - ~10% slower). As a workaround, I would not slice at all and store the index instead.

@AndyAyersMS
Copy link
Member

In the benchmark, the write barrier at the end is for copying into the return value struct. I don't see how the jit can omit that, since the jit does not know where that struct is located. A more representative test that mimics Advance would write back to _memory instead of returning a value.

For the case of Advance-- I think I was too optimistic earlier, and a write barrier is also required here, but want to think about it some more.

@GrabYourPitchforks
Copy link
Member

In particular, this pattern is the problematic one:

someObj._memField = someObj._memField.Slice(5);

The Memory<T> struct consists of three fields: an object and two ints. Slicing will return a new Memory<T> struct with the same object and two different int values. Essentially, this is a glorified someObj._memField._backingObj = someObj._memField._backingObj;, but JIT's control flow analysis doesn't seem to understand that a reference-type field is being assigned the value it already had, so it needs to go back through the write barrier.

Ideally JIT's control flow analysis would be able to recognize and optimize this pattern so all callers would get the benefit automatically. We could also solve this by introducing an API like static Memory<T>.SliceInPlace(ref Memory<T> memory, int index) : void if we really need an immediate workaround, but I'm hesitant to do this unless we have strong evidence it's necessary.

@davidfowl
Copy link
Member

this pattern is used in a bunch of places in ASP.NET Core and in the default Pipe implementation itself (basically reference types that need to store Memory<byte> instead of Span because they are heapable):

@AndyAyersMS
Copy link
Member

My concern is something like the following... maybe it is overblown?

Since the Memory<T> is a field m of some ref class instance r, it is visible cross-thread. And so it is possible between the time one thread reads the fields of m, computes a new struct value, and writes that back, that some other thread has reassigned m. I'll agree that it is hard to imagine valid programs doing this sort of thing, but doing so should not lead to GC holes.

Imagine initially that r is in "old space" and m's object field is a reference to an object in "new space". This means that there should be a card table bit set for the memory region containing m. Thread 1 reads the fields of m and starts in on the slicing operation. On thread 2, m is reassigned and the object field of m is now changed to reference some object in "old space". A GC kicks in. During GC, the original object (now only referenced on the stack for thread 1) is not promoted. Since m no longer contains an old->new reference, the card table bit is cleared. GC finishes and thread 1 resumes, finishes slicing, and prepares to update m. It must now write back the reference with a write barrier to avoid losing track of the old->new reference that it is about to re-introduce.

So to safely elide the barrier it seems to me that the jit needs to know that either (a) no multi-threaded access to m is possible; or (b) no GC has occurred between thread 1 reading the object field of m and writing it back.

@jkotas
Copy link
Member

jkotas commented Apr 4, 2019

So to safely elide the barrier

I think you would need to elide the write completely: The JIT would need to figure out that the value about to be written is same as what was read just recently and omit the write.

@AndyAyersMS
Copy link
Member

I think you would need to elide the write completely

If there was no GC between read and write, the only thing another thread could do wrt gc state is set card table bits, and we allow these to be an overstatement of the true old->new ref set (eg we don't ever clear card table bits via codegen, or use a write barrier when writing null). So the card table bits are either compatible with the value that was read, or are a safe overstatement. And so, the write does not need a barrier.

Assuming the above thinking is correct, it would be useful in other cases too, like the array element swap idiom.

Proving there can be no other thread interference is not going to be possible for the jit, in general. The only way to convey this currently is for the user code to copy the struct to a local operate on the local copy, then propagate the local back to the original after some sequence of operations. The operations need not all be visible in the local's declared scope, if it is a ref struct. But that pattern is likely not compatible with the current chatty API, I would guess there are few opportunities to amortize cost over a sequence.

Adding a method like SliceInPlace makes more sense to me, as it conveys the intent directly. But then you'd lose readonly which probably makes it a nonstarter.

I don't see anything we can do here in the short term, so moving this to future.

@GrabYourPitchforks
Copy link
Member

Adding SliceInPlace doesn't require removing readonly from the struct. For example:

public static void SliceInPlace(ref Memory<T> memory, int startIndex)
{
    // update memory._index and memory._length while leaving memory._object as-is
}

It would still fully "safe" (barring threading issues, but we make no guarantees there anyway), so from a safety + reliability perspective I don't have issue with such an API. It just strikes me as unclean. That's why I'd only recommend it as a backup if a JIT change isn't feasible.

@davidfowl
Copy link
Member

Is the performance benefit worth the API addition?

@GrabYourPitchforks
Copy link
Member

I don't think we have evidence for that yet. Our measurements so far are for microbenchmarks, but that doesn't mean it'll show up in end-to-end benchmarks.

@jkotas
Copy link
Member

jkotas commented Apr 9, 2019

Memory<T> was designed as relatively slow construct that should not be used on the hot paths. If the end-to-end benchmarks are showing it as bottleneck, it means the code is not using Memory<T> for what it was designed for.

@davidfowl
Copy link
Member

Memory was designed as relatively slow construct that should not be used on the hot paths. If the end-to-end benchmarks are showing it as bottleneck, it means the code is not using Memory for what it was designed for.

We say that and at the same time are changing the JsonWriter to a class which means we need to store a Memory<T> (or T[]). Would you argue that it's better to use an array an index rather than a Memory<T>?

@jkotas
Copy link
Member

jkotas commented Apr 9, 2019

it's better to use an array an index

I think so.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@NinoFloris
Copy link
Contributor

Instead of eliding it, could there be a branch for equality of the 'to be written reference' before calling the write barrier? Somewhat similar to the is in heap check, or how it does not write to the card table if it's already set?

Would doing a (volatile?) read even be a worthwhile cost to pay to reduce the cost of these kind of assignments? Maybe the JIT is able to help here by emitting the equality check barrier only if it can conclude the source location is the same as the destination location?

Doing this at all might be tricky considering the memory model we have but it may be ok to introduce this read in the write barrier? Not doing the write on equality could change existing data races and tearing to manifest themselves in slightly different ways.

So I'm not sure... It's an idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI JitUntriaged CLR JIT issues needing additional triage tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

9 participants