Skip to content
This repository has been archived by the owner on May 23, 2024. It is now read-only.

Identify features we want that are not in ISO #2

Open
2 of 4 tasks
ibaned opened this issue Sep 9, 2019 · 14 comments
Open
2 of 4 tasks

Identify features we want that are not in ISO #2

ibaned opened this issue Sep 9, 2019 · 14 comments
Assignees

Comments

@ibaned
Copy link
Contributor

ibaned commented Sep 9, 2019

  • single-expression conditional operator (if_then_else in stk_simd, choose in prototype)
  • strided loads & stores (more generally, scatter-gather)
  • cmath functions (the ISO paper doesn't mention these I think...)
  • single-expression fused multiply-add (examine whether this is actually faster than separate add and multiply, may not be needed if optimizer is good enough)
@nmhamster
Copy link

@ibaned / @alanhumphrey - in almost every compiler I've used with intrinsics, a multiply followed by an add intrinsic is converted to an FMA (if they compiler has FMA enabled). I would like us to avoid using lots of fancy (but unnecessarily complex) C++ to achieve what a minimal peephole optimizer can do.

@nmhamster
Copy link

Also, I wasn't sure why we still hadn't evaluated the use of vector_size attribute in GNU and Clang (https://clang.llvm.org/docs/LanguageExtensions.html). It seems we can build generic vector interfaces without the need for intrinsic functions, that would apply across platforms. I realize this is a C API and needs a C++ wrap, but the compiler should produce pretty well optimized code for this kind of attribute, perhaps better than intrinsics in some case if prefetches and other optimization phases fire (where intrinsics block them).

@ibaned
Copy link
Contributor Author

ibaned commented Sep 9, 2019

@nmhamster I personally was unaware of that. The way the ISO interface works, we have template specializations called "ABI"s (a bit of misnomer). Some of those "ABI"s are directly calling intrinsics, but I also implemented one where the data type was just float[4] and the loops have pragma omp simd in front of them. That actually gave decent (but not as good) speedup. I suppose we can do the same this for this approach, create an "ABI" specialization that implements things this way, and compare its performance. I do worry a little bit about the gaps in support in the table on that page, especially the lack of support for boolean operators.

@nmhamster
Copy link

@ibaned - right, I was thinking the same thing. I am interested to see what performance we get. The GCC vector attributes do support boolean operators as expr ? true-value : false-value (a little like your vector choose function). Producing masking behavior without true mask support in hardware is quite hard because in the end you usually have to rely on AND operations to get something similar (which I'm sure you know having written things for SSE in the library).

@DavidPoliakoff
Copy link

I would really recommend having a test suite, looping in James Elliott, and tracking performance across compilers. At LLNL we were toying with these kinds of libraries as I left, and it felt like every month we'd find out that such-and-such a compiler suddenly wasn't optimizing such-and-such a mechanism well anymore. If we have this work and a guide saying which compilers do better with which ABI's, we're in a good place.

@nmhamster
Copy link

Rather than just relying on profiling, I think we need to first hand actually take a good look at the code which is being generated and some of the compiler output. A human in the loop during development is essential to understanding why the compiler behaves as it does. Once we have that settled down a little more, I think the transition to profiled-based on-going assessment will be useful. I am particularly interested in whether we actually execute the vectorized code even in the event that we generate it since Intel in particular has some interesting runtime choices which sometimes make this not the case. In short, we should do some homework here as a preliminary step.

@alanphumphrey
Copy link

@DavidPoliakoff - Agreed on the test suite, etc. We talked at some length about this Friday. Also agree with @nmhamster on having a human in the loop initially, doing our homework, e.g., seeing what code is generated and whether we actually execute that vectorized code.

I will transition fully to this effort early next week (0.60 FTE is my SNL contract), and can stay on it for the necessary duration.

Thanks @ibaned for getting this conversation started.

@ibaned
Copy link
Contributor Author

ibaned commented Sep 9, 2019

Since this issue was originally about missing pieces to the ISO interface, I'm going to answer the question that @alanphumphrey asked in the other issue because it fits better here.

My thinking is that we should try to propose changes to the ISO interface, especially where we see that it cannot be as fast as hand-coding without those changes or it is super inconvenient without them. So far I think there are three changes we can think about individually:

  1. make many <cmath> functions like sin and cos work with simd types
  2. provide a conditional operator of some kind
  3. provide a scatter/gather interface, assuming there are special intrinsics for this and it is not just loading scalars (the stk_simd strided load is just loading scalars)

@ibaned
Copy link
Contributor Author

ibaned commented Jan 3, 2020

@nmhamster I have some early data on different high-level approaches using a full and very non-trivial Sandia application built using Clang on Mac:

  1. using float[8] with pragma clang loop vectorize(enable): 20 seconds
  2. using float __attribute__((vector_size(32))): 15 seconds
  3. using direct AVX intrinsics: 10 seconds

It seems like calling vendor-specific intrinsics can still be way better in many cases.

@mhoemmen
Copy link

mhoemmen commented Jan 5, 2020

@ibaned wrote:

2\. provide a conditional operator of some kind

Matthias Kretz had a proposal to permit overloading the ternary operator.

@ibaned
Copy link
Contributor Author

ibaned commented Jan 6, 2020

@mhoemmen awesome!
We just chose a random name for it and use it as a function (choose(cond,tv,fv)). should be as easy as moving its implementation once ternary can be overloaded.

@mhoemmen
Copy link

mhoemmen commented Jan 6, 2020

@ibaned It's still just a proposal :-) Not sure how long it will take to get through.

@ibaned
Copy link
Contributor Author

ibaned commented Jan 6, 2020

Yep won't hold my breath :)

@ibaned
Copy link
Contributor Author

ibaned commented Mar 27, 2020

In the spirit of recording things that we might want to create ISO C++ papers about, @alanw0 and the STK team identified that an equivalent of std::copysign is useful, and also this multiplysign:

T multiplysign(T a, T b) {
  return a * copysign(1.0, b);
}

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants