Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

~270 times faster writing (or maybe not writing?) with arr[i] than with *(arr+i)? #540

Open
DougLeonard opened this issue Jan 4, 2025 · 1 comment
Labels

Comments

@DougLeonard
Copy link

DougLeonard commented Jan 4, 2025

I just stumbled into this and it's not actually causing me a specific problem. Bug or not? Maybe.

  • [x ] Checked for duplicates
    I tried to.

Describe the bug

Looping over array writes using arr[i] is about 270 times faster(edited) than *(arr+i). I am not claiming this is definitely a real difference in write performance. That's complicated.

Expected behavior

I think there's room to quibble, for an interpreter especially, what "identical" means below, but I'd expect more similar results.

According to the standard (7.6.1.2 in maybe latest version)

"The expression E1[E2] is identical (by definition) to *((E1)+(E2)), except that in the case of an array operand, the result is an lvalue if that operand is an lvalue and an xvalue otherwise."

To Reproduce

Just run this in cling:

#include <iostream>
#include <chrono>
// Include necessary libraries


void benchmark() {
    const int size = (int)(1E7);
    volatile int* arr = new volatile int[size];  // Create a dynamic array
    const int multiplier = 10;  // loop the loop this many times

    //prefill to page memory maybe:
    for (int i = 0; i < size; i++) {
        arr[i] = size-i; 
    }

    auto start = std::chrono::high_resolution_clock::now();
    for (int j = 0; j < multiplier; j++){
        for (int i = 0; i < size; ++i) {
                arr[i] = i;
        }
    }
    auto end = std::chrono::high_resolution_clock::now();
    auto duration_ns = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start);
    std::chrono::duration<double> duration = std::chrono::duration<double>(duration_ns);
    std::cout << "Indexed:      " << duration.count() << " seconds" << std::endl;

    
    volatile int* arr2 = new volatile int[size];  // Create a dynamic array

    start = std::chrono::high_resolution_clock::now();
    for (int j = 0; j < multiplier; j++){
        for (int i = 0; i < size; ++i) {
//                arr2[i] = i;  // swap these lines for control test.
                *(arr2+i) = i;
        }
    }
    end = std::chrono::high_resolution_clock::now();
    duration_ns = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start);
    duration = std::chrono::duration<double>(duration_ns);
    std::cout << "Pointer math  " << duration.count() << " seconds" << std::endl;
}

int main () {
        benchmark();
}

I get some fluctuation but this is a typical output:

Indexed:      0.104759 seconds
Pointer math  27.4305 seconds

Setup

   ------------------------------------------------------------------
  | Welcome to ROOT 6.30/04                        https://root.cern |
  | (c) 1995-2024, The ROOT Team; conception: R. Brun, F. Rademakers |
  | Built for linuxx8664gcc on Jul 26 2024, 08:45:57                 |
  | From heads/master@tags/v6-30-04                                  |
  | With c++ (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0                   |
  | Try '.help'/'.?', '.demo', '.license', '.credits', '.quit'/'.q'  |
   ------------------------------------------------------------------

Compiled myself. This is an ubuntu VM running in VMware workstation on windows 10. I tried to keep system loads low and reproduce results.

Additional context

For what it's worth. I also made a version using a memory-mapped file for allocation, to maybe prevent optimizing away the loop writes, and results were the same.

Actually understanding what's going here is not so simple. The only way I know to be certain to prevent optimizing the writes away is to isolate them in separate compiled binary functions so that the interpreter cannot know what the callee is doing. I wrote a version like that as well. To prevent too much function call overhead, I compiled the inner loops as separate compilation units (.so files) using cling, and interpreted the outer code and outer loop. That gave 0.4s for both. This is also similar to what I get with g++ with default optimization.

So it's tempting to say that for the posted version, a) the interpreter is just really slow, and b) it optimizes away the writes in the indexed case. Even so, this is still unexpected behavior to me in that it it treats them differently.

But this may not be the case. The 0.1 seconds posted result is not the same as removing the loops. That gives about 1ns. Also the time does scale proportionally to the total number of writes, both inner and outer loops. So the loops are surely not being optimized away. The actual writes might be. But finally, in the modular version, (which I can provide) g++ -O3 gives ~ 0.04s (4E-2)s for each, and again, that scales correctly with both loop lengths. This tempts the conclusion that 0.1s in the original result can be real, and that the difference would be optimization, but not optimizing away the writes.

The total range here, real or not, is between 0.04s to 27s, a factor of 675.

@DougLeonard DougLeonard added the bug label Jan 4, 2025
@DougLeonard
Copy link
Author

DougLeonard commented Jan 6, 2025

For what it's worth, for fun I wrote optimized an assembly version of the inner loop using maximally unrolled maximally gratuitous SSE, the core of the code (that might even be right) being:

.PLOOP:
    movdqu  [rdi + rsi   +  0], xmm8
    movdqu  [rdi + rsi + 16], xmm9
    movdqu  [rdi + rsi + 32], xmm10
    movdqu  [rdi + rsi + 48], xmm11
    movdqu  [rdi + rsi + 64], xmm12
    movdqu  [rdi + rsi + 80], xmm13
    movdqu  [rdi + rsi + 96], xmm14
    movdqu  [rdi + rsi + 112], xmm15
    paddd xmm8, xmm0
    paddd xmm9, xmm1
    paddd xmm10, xmm2
    paddd xmm11, xmm3
    paddd xmm12, xmm4
    paddd xmm13, xmm5
    paddd xmm14, xmm6
    paddd xmm15, xmm7
    add rsi, 128
    cmp rdx, rsi
    jne .PLOOP
    cmp ecx, eax
    je .END

And this gives about 0.045s for the same 1E8, and I'm 100% sure it's not optimized away obviously. Being in a VM it's likely I/O limited. But this kind of sets the limit for performance in this context I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant