Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Allow stack allocate objects in loops #106526

Closed
wants to merge 3 commits into from

Conversation

hez2010
Copy link
Contributor

@hez2010 hez2010 commented Aug 16, 2024

Unlike stackalloc, stack allocated objects have no difference than using struct directly, and objects that don't escape from a loop are likely to be promoted later. Allow stack allocate objects in loops to get rid of more boxes.
Inspired from https://devblogs.microsoft.com/java/improving-openjdk-scalar-replacement-part-1-3.

class Test
{
    public static int CompositeChecksum(string[] messages)
    {
        int checksum = 0;
        foreach (var msg in messages)
        {
            var m = new Message(msg);
            int cs = m.Checksum();
            checksum += cs;
        }
        return checksum;
    }
}

class Message(string content)
{
    public string Content = content;

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public int Checksum()
    {
        int chks = 0;
        for (int i = 0; i < Content.Length; i++)
        {
            chks += Content[i];
        }
        return chks;
    }
}

New codegen for CompositeChecksum:

; Assembly listing for method Program+Foo:CompositeChecksum(System.String[]):int (Tier1)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; Tier1 code
; optimized code
; optimized using Dynamic PGO
; rsp based frame
; fully interruptible
; with Dynamic PGO: fgCalledCount is 46
; 1 inlinees with PGO data; 2 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T08] (  3,  3.00)     ref  ->  rcx         class-hnd single-def <System.String[]>
;  V01 loc0         [V01,T05] (  4,  6.00)     int  ->  rax
;  V02 loc1         [V02,T09] (  3,  2.67)     ref  ->  rcx         class-hnd exact single-def <System.String[]>
;* V03 loc2         [V03,T12] (  0,  0   )     int  ->  zero-ref
;* V04 loc3         [V04    ] (  0,  0   )     int1084  ->
zero-ref
;# V05 OutArgs      [V05    ] (  1,  1   )  struct ( 0) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V06 tmp1         [V06    ] (  0,  0   )    long  ->  zero-ref    class-hnd exact "NewObj constructor temp" <Program+Message>
;* V07 tmp2         [V07    ] (  0,  0   )     ref  ->  zero-ref    class-hnd exact "Inlining Arg" <System.String>
;  V08 tmp3         [V08,T02] (  4, 24   )     int  ->  r10         "Inline stloc first use temp"
;* V09 tmp4         [V09,T11] (  0,  0   )     int  ->  zero-ref    "Inline stloc first use temp"
;* V10 tmp5         [V10    ] (  0,  0   )  struct (16) zero-ref    do-not-enreg[SF] "stack allocated ref class temp" <Program+Message>
;* V11 tmp6         [V11    ] (  0,  0   )    long  ->  zero-ref    "V10.[000..008)"
;  V12 tmp7         [V12,T06] (  3,  5.67)     ref  ->   r8         "V10.[008..016)"
;  V13 cse0         [V13,T07] (  3,  5.67)     int  ->   r9         "CSE #01: aggressive"
;  V14 cse1         [V14,T10] (  3,  2.67)     int  ->  rdx         "CSE #02: aggressive"
;  V15 rat0         [V15,T03] (  4,  6.67)   byref  ->  rcx         "Strength reduced derived IV"
;  V16 rat1         [V16,T04] (  4,  6.67)     int  ->  rdx         "Trip count IV"
;  V17 rat2         [V17,T00] (  4, 31.67)   byref  ->   r8         "Strength reduced derived IV"
;  V18 rat3         [V18,T01] (  4, 31.67)     int  ->   r9         "Trip count IV"
;
; Lcl frame size = 0

G_M27921_IG01:  ;; offset=0x0000
                                                ;; size=0 bbWeight=1.00 PerfScore 0.00
G_M27921_IG02:  ;; offset=0x0000
       xor      eax, eax
       mov      edx, dword ptr [rcx+0x08]
       test     edx, edx
       jle      SHORT G_M27921_IG08
                                                ;; size=9 bbWeight=1.00 PerfScore 3.50
G_M27921_IG03:  ;; offset=0x0009
       add      rcx, 16
                                                ;; size=4 bbWeight=0.67 PerfScore 0.17
G_M27921_IG04:  ;; offset=0x000D
       mov      r8, gword ptr [rcx]
       xor      r10d, r10d
       mov      r9d, dword ptr [r8+0x08]
       test     r9d, r9d
       jle      SHORT G_M27921_IG07
                                                ;; size=15 bbWeight=2 PerfScore 11.00
G_M27921_IG05:  ;; offset=0x001C
       add      r8, 12
       align    [0 bytes for IG06]
                                                ;; size=4 bbWeight=1.67 PerfScore 0.42
G_M27921_IG06:  ;; offset=0x0020
       movzx    r11, word  ptr [r8]
       add      r10d, r11d
       add      r8, 2
       dec      r9d
       jne      SHORT G_M27921_IG06
                                                ;; size=16 bbWeight=10 PerfScore 37.50
G_M27921_IG07:  ;; offset=0x0030
       add      eax, r10d
       add      rcx, 8
       dec      edx
       jne      SHORT G_M27921_IG04
                                                ;; size=11 bbWeight=2.00 PerfScore 3.50
G_M27921_IG08:  ;; offset=0x003B
       ret
                                                ;; size=1 bbWeight=1.00 PerfScore 1.00

; Total bytes of code 60, prolog size 0, PerfScore 57.08, instruction count 22, allocated bytes for code 60 (MethodHash=8fa792ee) for method Program+Foo:CompositeChecksum(System.String[]):int (Tier1)

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Aug 16, 2024
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Aug 16, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@jakobbotsch
Copy link
Member

It doesn't appear that this addresses the fundamental problems with why we do not allow this?

https://github.com/dotnet/runtime/blob/main/docs/design/coreclr/jit/object-stack-allocation.md#other-restrictions-on-stack-allocations

Objects allocated in a loop can be stack allocated only if the allocation doesn't escape the iteration of the loop in which it is allocated. Such analysis is complicated and is beyond the scope of at least the initial implementation.

dotnet/coreclr#20251 (comment)

@hez2010
Copy link
Contributor Author

hez2010 commented Aug 16, 2024

Thanks for the historical discussion link! Seems that we need more analysis to make sure the object doesn't escape from the iteration of the loop before enabling this.

Going to let the CI finish to see how much we can benefit from this (approximately).

@hez2010
Copy link
Contributor Author

hez2010 commented Aug 16, 2024

@MihuBot

@hez2010
Copy link
Contributor Author

hez2010 commented Aug 16, 2024

@jakobbotsch It does give us really nice diffs. For example, it managed to remove the box for array[i]!.Equals(value) in ObjectEqualityComparer:IndexOf, and the box for !object.ReferenceEquals(node, nodes[i]) in ExpressionVisitor:Visit<T>.
See https://gist.github.com/MihuBot/0f096701b12609d54ed62e7f192e4731
I think it's worth to investigate further.

@hez2010
Copy link
Contributor Author

hez2010 commented Aug 27, 2024

@MihuBot

@hez2010
Copy link
Contributor Author

hez2010 commented Aug 27, 2024

Addresses the fundamental problems by checking locals in preds to make sure locals in a loop won't be stored outsides the loop, now tests are passing.
@jakobbotsch PTAL

Ideally we can build a def-use for this instead of checking the preds, but the current way should be enough (though a bit conservative).

Comment on lines +264 to +280
if (basicBlockHasBackwardJump)
{
lclsInPreds = BitVecOps::MakeEmpty(&m_bitVecTraits);
for (BasicBlock* const predBlock : block->PredBlocks())
{
for (Statement* const stmt : predBlock->Statements())
{
GenTree* stmtExpr = stmt->GetRootNode();
if (stmtExpr->OperIsLocalStore())
{
GenTreeLclVarCommon* lclVar = stmtExpr->AsLclVarCommon();
unsigned int lclNum = lclVar->GetLclNum();
BitVecOps::AddElemD(&m_bitVecTraits, lclsInPreds, lclNum);
}
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is checking the preds like this sufficient? What prevents the allocated object from being stored in some other block inside the loop that reaches the backedge?

I think doing this analysis is going to require building the proper loop structures and probably also some form of liveness.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What prevents the allocated object from being stored in some other block inside the loop that reaches the backedge?

My thought is that when we see a backward edge, we check the local where the object is being stored to see if the target is a local defined in its preds, otherwise mark it escaping. For blocks that have back edges to the current block, they will be checked later so that we don't miss them.
IMO this should be a conservative approximation without accurate loop and liveness.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with Jakob here. I think the analysis is more complicated than this, and we need proper loop structure.

If you have a proper loop where the entry block dominates the body, then locals defined in the entry block can still be live on the back edges if they are first used before being defined.

Building loops and running liveness seems to be a necessary prerequisite, and unfortunately, liveness will be impaired at this stage of the jit, because we haven't yet figured out which locals are truly exposed, so we may end up also needing points-to analysis to help refine what can be aliased (and that likely should dovetail with or supplant the existing escape analysis), and proper loopness might also be a challenge given that we might end up creating a new loop after this phase because of a recursive tail call.

Assuming we can indeed soundly prove that an object's references cannot outlive a loop iteration (accounting for any possible liveness from EH), we'd also need to ensure that the object fields are zeroed (as they would be for a heap allocation) and/or that they are non-gc fields that are definitely written before they are read.

None of these problems are impossible, but it's a fair amount of work.

@JulieLeeMSFT
Copy link
Member

@BruceForstall, PTAL at this community PR.

@JulieLeeMSFT JulieLeeMSFT added this to the 10.0.0 milestone Sep 23, 2024
@JulieLeeMSFT JulieLeeMSFT requested review from jakobbotsch and removed request for BruceForstall October 7, 2024 16:12
@BruceForstall
Copy link
Member

@AndyAyersMS since you're thinking about stack allocation enhancements (#104936), can you take a look?

@jakobbotsch
Copy link
Member

I'll try to take a look at this one soon...

@AndyAyersMS
Copy link
Member

@hez2010 I don't believe we can take something like this yet, there are too many missing pieces.

@hez2010
Copy link
Contributor Author

hez2010 commented Nov 26, 2024

Feel free to close the PR as I don't have enough time to get into this yet.

@AndyAyersMS
Copy link
Member

Feel free to close the PR as I don't have enough time to get into this yet.

Ok, let's close this one then. I still think it's worth pursuing, someday, but we'll need to think carefully about how we can make it happen.

@github-actions github-actions bot locked and limited conversation to collaborators Dec 28, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants