You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I just stumbled into this and it's not actually causing me a specific problem. Bug or not? Maybe.
[x ] Checked for duplicates
I tried to.
Describe the bug
Looping over array writes using arr[i] is about 270 times faster(edited) than *(arr+i). I am not claiming this is definitely a real difference in write performance. That's complicated.
Expected behavior
I think there's room to quibble, for an interpreter especially, what "identical" means below, but I'd expect more similar results.
According to the standard (7.6.1.2 in maybe latest version)
"The expression E1[E2] is identical (by definition) to *((E1)+(E2)), except that in the case of an array operand, the result is an lvalue if that operand is an lvalue and an xvalue otherwise."
To Reproduce
Just run this in cling:
#include <iostream>
#include <chrono>
// Include necessary libraries
void benchmark() {
const int size = (int)(1E7);
volatile int* arr = new volatile int[size]; // Create a dynamic array
const int multiplier = 10; // loop the loop this many times
//prefill to page memory maybe:
for (int i = 0; i < size; i++) {
arr[i] = size-i;
}
auto start = std::chrono::high_resolution_clock::now();
for (int j = 0; j < multiplier; j++){
for (int i = 0; i < size; ++i) {
arr[i] = i;
}
}
auto end = std::chrono::high_resolution_clock::now();
auto duration_ns = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start);
std::chrono::duration<double> duration = std::chrono::duration<double>(duration_ns);
std::cout << "Indexed: " << duration.count() << " seconds" << std::endl;
volatile int* arr2 = new volatile int[size]; // Create a dynamic array
start = std::chrono::high_resolution_clock::now();
for (int j = 0; j < multiplier; j++){
for (int i = 0; i < size; ++i) {
// arr2[i] = i; // swap these lines for control test.
*(arr2+i) = i;
}
}
end = std::chrono::high_resolution_clock::now();
duration_ns = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start);
duration = std::chrono::duration<double>(duration_ns);
std::cout << "Pointer math " << duration.count() << " seconds" << std::endl;
}
int main () {
benchmark();
}
I get some fluctuation but this is a typical output:
Indexed: 0.104759 seconds
Pointer math 27.4305 seconds
Setup
------------------------------------------------------------------
| Welcome to ROOT 6.30/04 https://root.cern |
| (c) 1995-2024, The ROOT Team; conception: R. Brun, F. Rademakers |
| Built for linuxx8664gcc on Jul 26 2024, 08:45:57 |
| From heads/master@tags/v6-30-04 |
| With c++ (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 |
| Try '.help'/'.?', '.demo', '.license', '.credits', '.quit'/'.q' |
------------------------------------------------------------------
Compiled myself. This is an ubuntu VM running in VMware workstation on windows 10. I tried to keep system loads low and reproduce results.
Additional context
For what it's worth. I also made a version using a memory-mapped file for allocation, to maybe prevent optimizing away the loop writes, and results were the same.
Actually understanding what's going here is not so simple. The only way I know to be certain to prevent optimizing the writes away is to isolate them in separate compiled binary functions so that the interpreter cannot know what the callee is doing. I wrote a version like that as well. To prevent too much function call overhead, I compiled the inner loops as separate compilation units (.so files) using cling, and interpreted the outer code and outer loop. That gave 0.4s for both. This is also similar to what I get with g++ with default optimization.
So it's tempting to say that for the posted version, a) the interpreter is just really slow, and b) it optimizes away the writes in the indexed case. Even so, this is still unexpected behavior to me in that it it treats them differently.
But this may not be the case. The 0.1 seconds posted result is not the same as removing the loops. That gives about 1ns. Also the time does scale proportionally to the total number of writes, both inner and outer loops. So the loops are surely not being optimized away. The actual writes might be. But finally, in the modular version, (which I can provide) g++ -O3 gives ~ 0.04s (4E-2)s for each, and again, that scales correctly with both loop lengths. This tempts the conclusion that 0.1s in the original result can be real, and that the difference would be optimization, but not optimizing away the writes.
The total range here, real or not, is between 0.04s to 27s, a factor of 675.
The text was updated successfully, but these errors were encountered:
For what it's worth, for fun I wrote optimized an assembly version of the inner loop using maximally unrolled maximally gratuitous SSE, the core of the code (that might even be right) being:
And this gives about 0.045s for the same 1E8, and I'm 100% sure it's not optimized away obviously. Being in a VM it's likely I/O limited. But this kind of sets the limit for performance in this context I think.
I just stumbled into this and it's not actually causing me a specific problem. Bug or not? Maybe.
I tried to.
Describe the bug
Looping over array writes using arr[i] is about 270 times faster(edited) than *(arr+i). I am not claiming this is definitely a real difference in write performance. That's complicated.
Expected behavior
I think there's room to quibble, for an interpreter especially, what "identical" means below, but I'd expect more similar results.
According to the standard (7.6.1.2 in maybe latest version)
"The expression E1[E2] is identical (by definition) to *((E1)+(E2)), except that in the case of an array operand, the result is an lvalue if that operand is an lvalue and an xvalue otherwise."
To Reproduce
Just run this in cling:
I get some fluctuation but this is a typical output:
Setup
Compiled myself. This is an ubuntu VM running in VMware workstation on windows 10. I tried to keep system loads low and reproduce results.
Additional context
For what it's worth. I also made a version using a memory-mapped file for allocation, to maybe prevent optimizing away the loop writes, and results were the same.
Actually understanding what's going here is not so simple. The only way I know to be certain to prevent optimizing the writes away is to isolate them in separate compiled binary functions so that the interpreter cannot know what the callee is doing. I wrote a version like that as well. To prevent too much function call overhead, I compiled the inner loops as separate compilation units (.so files) using cling, and interpreted the outer code and outer loop. That gave 0.4s for both. This is also similar to what I get with g++ with default optimization.
So it's tempting to say that for the posted version, a) the interpreter is just really slow, and b) it optimizes away the writes in the indexed case. Even so, this is still unexpected behavior to me in that it it treats them differently.
But this may not be the case. The 0.1 seconds posted result is not the same as removing the loops. That gives about 1ns. Also the time does scale proportionally to the total number of writes, both inner and outer loops. So the loops are surely not being optimized away. The actual writes might be. But finally, in the modular version, (which I can provide) g++ -O3 gives ~ 0.04s (4E-2)s for each, and again, that scales correctly with both loop lengths. This tempts the conclusion that 0.1s in the original result can be real, and that the difference would be optimization, but not optimizing away the writes.
The total range here, real or not, is between 0.04s to 27s, a factor of 675.
The text was updated successfully, but these errors were encountered: