JIT: Allow stack allocate objects in loops #106526

hez2010 · 2024-08-16T06:40:25Z

Unlike stackalloc, stack allocated objects have no difference than using struct directly, and objects that don't escape from a loop are likely to be promoted later. Allow stack allocate objects in loops to get rid of more boxes.
Inspired from https://devblogs.microsoft.com/java/improving-openjdk-scalar-replacement-part-1-3.

class Test
{
    public static int CompositeChecksum(string[] messages)
    {
        int checksum = 0;
        foreach (var msg in messages)
        {
            var m = new Message(msg);
            int cs = m.Checksum();
            checksum += cs;
        }
        return checksum;
    }
}

class Message(string content)
{
    public string Content = content;

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public int Checksum()
    {
        int chks = 0;
        for (int i = 0; i < Content.Length; i++)
        {
            chks += Content[i];
        }
        return chks;
    }
}

New codegen for CompositeChecksum:

; Assembly listing for method Program+Foo:CompositeChecksum(System.String[]):int (Tier1)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; Tier1 code
; optimized code
; optimized using Dynamic PGO
; rsp based frame
; fully interruptible
; with Dynamic PGO: fgCalledCount is 46
; 1 inlinees with PGO data; 2 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T08] (  3,  3.00)     ref  ->  rcx         class-hnd single-def <System.String[]>
;  V01 loc0         [V01,T05] (  4,  6.00)     int  ->  rax
;  V02 loc1         [V02,T09] (  3,  2.67)     ref  ->  rcx         class-hnd exact single-def <System.String[]>
;* V03 loc2         [V03,T12] (  0,  0   )     int  ->  zero-ref
;* V04 loc3         [V04    ] (  0,  0   )     int1084  ->
zero-ref
;# V05 OutArgs      [V05    ] (  1,  1   )  struct ( 0) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V06 tmp1         [V06    ] (  0,  0   )    long  ->  zero-ref    class-hnd exact "NewObj constructor temp" <Program+Message>
;* V07 tmp2         [V07    ] (  0,  0   )     ref  ->  zero-ref    class-hnd exact "Inlining Arg" <System.String>
;  V08 tmp3         [V08,T02] (  4, 24   )     int  ->  r10         "Inline stloc first use temp"
;* V09 tmp4         [V09,T11] (  0,  0   )     int  ->  zero-ref    "Inline stloc first use temp"
;* V10 tmp5         [V10    ] (  0,  0   )  struct (16) zero-ref    do-not-enreg[SF] "stack allocated ref class temp" <Program+Message>
;* V11 tmp6         [V11    ] (  0,  0   )    long  ->  zero-ref    "V10.[000..008)"
;  V12 tmp7         [V12,T06] (  3,  5.67)     ref  ->   r8         "V10.[008..016)"
;  V13 cse0         [V13,T07] (  3,  5.67)     int  ->   r9         "CSE #01: aggressive"
;  V14 cse1         [V14,T10] (  3,  2.67)     int  ->  rdx         "CSE #02: aggressive"
;  V15 rat0         [V15,T03] (  4,  6.67)   byref  ->  rcx         "Strength reduced derived IV"
;  V16 rat1         [V16,T04] (  4,  6.67)     int  ->  rdx         "Trip count IV"
;  V17 rat2         [V17,T00] (  4, 31.67)   byref  ->   r8         "Strength reduced derived IV"
;  V18 rat3         [V18,T01] (  4, 31.67)     int  ->   r9         "Trip count IV"
;
; Lcl frame size = 0

G_M27921_IG01:  ;; offset=0x0000
                                                ;; size=0 bbWeight=1.00 PerfScore 0.00
G_M27921_IG02:  ;; offset=0x0000
       xor      eax, eax
       mov      edx, dword ptr [rcx+0x08]
       test     edx, edx
       jle      SHORT G_M27921_IG08
                                                ;; size=9 bbWeight=1.00 PerfScore 3.50
G_M27921_IG03:  ;; offset=0x0009
       add      rcx, 16
                                                ;; size=4 bbWeight=0.67 PerfScore 0.17
G_M27921_IG04:  ;; offset=0x000D
       mov      r8, gword ptr [rcx]
       xor      r10d, r10d
       mov      r9d, dword ptr [r8+0x08]
       test     r9d, r9d
       jle      SHORT G_M27921_IG07
                                                ;; size=15 bbWeight=2 PerfScore 11.00
G_M27921_IG05:  ;; offset=0x001C
       add      r8, 12
       align    [0 bytes for IG06]
                                                ;; size=4 bbWeight=1.67 PerfScore 0.42
G_M27921_IG06:  ;; offset=0x0020
       movzx    r11, word  ptr [r8]
       add      r10d, r11d
       add      r8, 2
       dec      r9d
       jne      SHORT G_M27921_IG06
                                                ;; size=16 bbWeight=10 PerfScore 37.50
G_M27921_IG07:  ;; offset=0x0030
       add      eax, r10d
       add      rcx, 8
       dec      edx
       jne      SHORT G_M27921_IG04
                                                ;; size=11 bbWeight=2.00 PerfScore 3.50
G_M27921_IG08:  ;; offset=0x003B
       ret
                                                ;; size=1 bbWeight=1.00 PerfScore 1.00

; Total bytes of code 60, prolog size 0, PerfScore 57.08, instruction count 22, allocated bytes for code 60 (MethodHash=8fa792ee) for method Program+Foo:CompositeChecksum(System.String[]):int (Tier1)

dotnet-policy-service · 2024-08-16T06:40:53Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

jakobbotsch · 2024-08-16T08:28:53Z

It doesn't appear that this addresses the fundamental problems with why we do not allow this?

https://github.com/dotnet/runtime/blob/main/docs/design/coreclr/jit/object-stack-allocation.md#other-restrictions-on-stack-allocations

Objects allocated in a loop can be stack allocated only if the allocation doesn't escape the iteration of the loop in which it is allocated. Such analysis is complicated and is beyond the scope of at least the initial implementation.

dotnet/coreclr#20251 (comment)

hez2010 · 2024-08-16T08:33:18Z

Thanks for the historical discussion link! Seems that we need more analysis to make sure the object doesn't escape from the iteration of the loop before enabling this.

Going to let the CI finish to see how much we can benefit from this (approximately).

hez2010 · 2024-08-16T08:34:51Z

@MihuBot

hez2010 · 2024-08-16T09:21:09Z

@jakobbotsch It does give us really nice diffs. For example, it managed to remove the box for array[i]!.Equals(value) in ObjectEqualityComparer:IndexOf, and the box for !object.ReferenceEquals(node, nodes[i]) in ExpressionVisitor:Visit<T>.
See https://gist.github.com/MihuBot/0f096701b12609d54ed62e7f192e4731
I think it's worth to investigate further.

hez2010 · 2024-08-27T05:03:51Z

@MihuBot

hez2010 · 2024-08-27T07:08:40Z

Addresses the fundamental problems by checking locals in preds to make sure locals in a loop won't be stored outsides the loop, now tests are passing.
@jakobbotsch PTAL

Ideally we can build a def-use for this instead of checking the preds, but the current way should be enough (though a bit conservative).

jakobbotsch · 2024-08-27T08:22:45Z

src/coreclr/jit/objectalloc.cpp

+        if (basicBlockHasBackwardJump)
+        {
+            lclsInPreds = BitVecOps::MakeEmpty(&m_bitVecTraits);
+            for (BasicBlock* const predBlock : block->PredBlocks())
+            {
+                for (Statement* const stmt : predBlock->Statements())
+                {
+                    GenTree* stmtExpr = stmt->GetRootNode();
+                    if (stmtExpr->OperIsLocalStore())
+                    {
+                        GenTreeLclVarCommon* lclVar = stmtExpr->AsLclVarCommon();
+                        unsigned int         lclNum = lclVar->GetLclNum();
+                        BitVecOps::AddElemD(&m_bitVecTraits, lclsInPreds, lclNum);
+                    }
+                }
+            }
+        }


Why is checking the preds like this sufficient? What prevents the allocated object from being stored in some other block inside the loop that reaches the backedge?

I think doing this analysis is going to require building the proper loop structures and probably also some form of liveness.

What prevents the allocated object from being stored in some other block inside the loop that reaches the backedge?

My thought is that when we see a backward edge, we check the local where the object is being stored to see if the target is a local defined in its preds, otherwise mark it escaping. For blocks that have back edges to the current block, they will be checked later so that we don't miss them.
IMO this should be a conservative approximation without accurate loop and liveness.

Agree with Jakob here. I think the analysis is more complicated than this, and we need proper loop structure.

If you have a proper loop where the entry block dominates the body, then locals defined in the entry block can still be live on the back edges if they are first used before being defined.

Building loops and running liveness seems to be a necessary prerequisite, and unfortunately, liveness will be impaired at this stage of the jit, because we haven't yet figured out which locals are truly exposed, so we may end up also needing points-to analysis to help refine what can be aliased (and that likely should dovetail with or supplant the existing escape analysis), and proper loopness might also be a challenge given that we might end up creating a new loop after this phase because of a recursive tail call.

Assuming we can indeed soundly prove that an object's references cannot outlive a loop iteration (accounting for any possible liveness from EH), we'd also need to ensure that the object fields are zeroed (as they would be for a heap allocation) and/or that they are non-gc fields that are definitely written before they are read.

None of these problems are impossible, but it's a fair amount of work.

JulieLeeMSFT · 2024-09-23T17:22:52Z

@BruceForstall, PTAL at this community PR.

BruceForstall · 2024-10-21T18:43:29Z

@AndyAyersMS since you're thinking about stack allocation enhancements (#104936), can you take a look?

jakobbotsch · 2024-10-25T09:45:59Z

I'll try to take a look at this one soon...

AndyAyersMS · 2024-11-26T01:03:34Z

@hez2010 I don't believe we can take something like this yet, there are too many missing pieces.

hez2010 · 2024-11-26T01:16:04Z

Feel free to close the PR as I don't have enough time to get into this yet.

AndyAyersMS · 2024-11-26T15:58:58Z

Feel free to close the PR as I don't have enough time to get into this yet.

Ok, let's close this one then. I still think it's worth pursuing, someday, but we'll need to think carefully about how we can make it happen.

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Aug 16, 2024

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Aug 16, 2024

MihuBot mentioned this pull request Aug 16, 2024

[JitDiff X64] [hez2010] JIT: Allow stack allocate objects in loops MihuBot/runtime-utils#560

Open

This was referenced Aug 16, 2024

System.IO.Net5Compat.Tests and System.IO.Tests suddenly exiting with error 137 #100558

Open

SIGKILL (OOM?) while running LibraryImportGenerator.Tests w/o actionable log messages or artifacts dotnet/dnceng#2496

Open

AndyAyersMS mentioned this pull request Aug 16, 2024

Stack Allocation Enhancements #104936

Open

21 tasks

hez2010 added 3 commits August 27, 2024 11:51

Allow stack allocate objects in loop

442aefd

Format

a2825ac

Check preds

463f36e

hez2010 force-pushed the stackalloc-loop branch from 63e6984 to 463f36e Compare August 27, 2024 04:54

MihuBot mentioned this pull request Aug 27, 2024

[JitDiff X64] [hez2010] JIT: Allow stack allocate objects in loops MihuBot/runtime-utils#570

Open

jakobbotsch reviewed Aug 27, 2024

View reviewed changes

build-analysis bot mentioned this pull request Aug 27, 2024

The Operation will be canceled. The next steps may not contain expected logs. dotnet/dnceng#3008

Open

3 tasks

JulieLeeMSFT requested a review from BruceForstall September 23, 2024 17:23

JulieLeeMSFT assigned hez2010 Sep 23, 2024

JulieLeeMSFT added this to the 10.0.0 milestone Sep 23, 2024

JulieLeeMSFT requested review from jakobbotsch and removed request for BruceForstall October 7, 2024 16:12

BruceForstall requested a review from AndyAyersMS October 21, 2024 18:43

AndyAyersMS closed this Nov 26, 2024

github-actions bot locked and limited conversation to collaborators Dec 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: Allow stack allocate objects in loops #106526

JIT: Allow stack allocate objects in loops #106526

hez2010 commented Aug 16, 2024 •

edited

Loading

dotnet-policy-service bot commented Aug 16, 2024

jakobbotsch commented Aug 16, 2024

hez2010 commented Aug 16, 2024 •

edited

Loading

hez2010 commented Aug 16, 2024

hez2010 commented Aug 16, 2024 •

edited

Loading

hez2010 commented Aug 27, 2024

hez2010 commented Aug 27, 2024 •

edited

Loading

jakobbotsch Aug 27, 2024

hez2010 Aug 27, 2024

AndyAyersMS Nov 1, 2024

JulieLeeMSFT commented Sep 23, 2024

BruceForstall commented Oct 21, 2024

jakobbotsch commented Oct 25, 2024

AndyAyersMS commented Nov 26, 2024

hez2010 commented Nov 26, 2024

AndyAyersMS commented Nov 26, 2024

JIT: Allow stack allocate objects in loops #106526

JIT: Allow stack allocate objects in loops #106526

Conversation

hez2010 commented Aug 16, 2024 • edited Loading

dotnet-policy-service bot commented Aug 16, 2024

jakobbotsch commented Aug 16, 2024

hez2010 commented Aug 16, 2024 • edited Loading

hez2010 commented Aug 16, 2024

hez2010 commented Aug 16, 2024 • edited Loading

hez2010 commented Aug 27, 2024

hez2010 commented Aug 27, 2024 • edited Loading

jakobbotsch Aug 27, 2024

Choose a reason for hiding this comment

hez2010 Aug 27, 2024

Choose a reason for hiding this comment

AndyAyersMS Nov 1, 2024

Choose a reason for hiding this comment

JulieLeeMSFT commented Sep 23, 2024

BruceForstall commented Oct 21, 2024

jakobbotsch commented Oct 25, 2024

AndyAyersMS commented Nov 26, 2024

hez2010 commented Nov 26, 2024

AndyAyersMS commented Nov 26, 2024

hez2010 commented Aug 16, 2024 •

edited

Loading

hez2010 commented Aug 16, 2024 •

edited

Loading

hez2010 commented Aug 16, 2024 •

edited

Loading

hez2010 commented Aug 27, 2024 •

edited

Loading