diff --git a/rfcs/proposed/numa_support/tbbbind-link-static-hwloc.org b/rfcs/proposed/numa_support/tbbbind-link-static-hwloc.org new file mode 100755 index 0000000000..da67be97cc --- /dev/null +++ b/rfcs/proposed/numa_support/tbbbind-link-static-hwloc.org @@ -0,0 +1,123 @@ +# -*- fill-column: 80; -*- + +#+title: Link ~tbbbind~ with static HWLOC to improve predictability of NUMA support API + +*Note:* This document is a sub-RFC of the https://github.com/oneapi-src/oneTBB/pull/1535. +Specifically, the "Increased availability of NUMA support" section. + +* Introduction +oneTBB has a soft dependency on several variants of ~tbbbind~, which +the library loads during the initialization stage. Each ~tbbbind~, in turn, has +a hard dependency on a specific version of the HWLOC library [1, 2]. The soft +dependency means that the library continues the execution +even if the system loader fails to resolve the hard dependency on HWLOC for +~tbbbind~. In this case, oneTBB does not discover the hardware topology. +Instead, it defaults to viewing all CPU cores as uniform, consistent with TBB behavior when +NUMA constraints are not used. As a result, the following code returns the irrelevant values that +do not reflect the actual topology: + +#+begin_src C++ +std::vector numa_nodes = oneapi::tbb::info::numa_nodes(); +std::vector core_types = oneapi::tbb::info::core_types(); +#+end_src + +This lack of valid HW topology, caused by the absence of a third-party library, is +the major problem with the current oneTBB behavior. The problem lies in the lack of diagnostics +making it difficult for developers to detect. +As a result, the code continues to run but fails to use NUMA as intended. + +Dependency on a shared HWLOC library has the following benefits: +1. Code reuse with all of the positive consequences out of this, including + relying on the same code that has been tested and debugged, allowing the OS + to share it among different processes, which consequently improves on cache + locality and memory footprint. That's the primary purpose of shared + libraries. +2. A drop-in replacement. Users are able to use their own version of HWLOC + without recompilation of oneTBB. This specific version of HWLOC could include + a hotfix to support a particular and/or new hardware that a customer has, but + whose support is not yet upstreamed to HWLOC project. It is also possible + that such support won't be upstreamed at all if that hardware is not going to + be available for massive users. It could also be a development version of + HWLOC that someone wants to test on their systems first. Of course, they can + do it with the static version as well, but that's more cumbersome as it + requires recompilation of every dependent component. + +The only disadvantage from depending on HWLOC library dynamically is that the +developers that use oneTBB's NUMA support API need to make sure the library is +available and can be found by oneTBB. Depending on the distribution model of a +developer's code, this is achieved either by: +1. Asking the end user to have necessary version of a dependency pre-installed. +2. Bundling necessary HWLOC version together with other pieces of a product + release. + +However, the requirement to fulfill one of the above steps for the NUMA API to +start paying off may be considered as an incovenience and, what is more +important, it is not always obvious that one of these steps is needed. +Especially, due to silent behavior in case HWLOC library cannot be found in the +environment. + +The proposal is to reduce the effect of the disadvantage +of relying on a dynamic HWLOC library. +The improvements involve statically linking HWLOC with one of the ~tbbbind~ libraries distributed together +with oneTBB. At the same time, you retain the flexibility to specify different version of HWLOC library +if needed. + +Since HWLOC 1.x is an older version and modern operating +systems install HWLOC 2.x by default, the probability of users being +restricted to HWLOC 1.x is relatively small. Thus, +we can reuse the filename of the ~tbbbind~ library linked to HWLOC 1.x +for the library linked against a static HWLOC 2.x. + +* Proposal +1. Replace the dynamic link of ~tbbbind~ library currently linked + against HWLOC 1.x with a link to a static HWLOC library version 2.x. +2. Add loading of that ~tbbbind~ variant as the last attempt to resolve the + dependency on functionality provided by the ~tbbbind~ layer. +3. Update the oneTBB documentation, including [[https://oneapi-src.github.io/oneTBB/search.html?q=tbb%3A%3Ainfo][these pages]], to + detail the steps for identifying which ~tbbbind~ is being used. + +** Advantages +1. The proposed behavior introduces a fallback mechanism for resolving + the HWLOC library dependency when it is not in the environment, while still + preferring user-provided versions. As a result, the problematic oneTBB API usage + works as expected, returning an enumerated list + of actual NUMA nodes and core types on the system the code is running on, + provided that the loaded HWLOC library works on that system and that an + application properly distributes all binaries of oneTBB, sets the environment + so that the necessary variant of ~tbbbind~ library can be found and loaded. +2. Dropping support for HWLOC 1.x, does not introduce an additional + ~tbbbind~ variant while maintaining support for widely used + versions of HWLOC. + +** Disadvantages +By default, there is still no diagnostics if you fail to correctly setup an environment with your +version of HWLOC. Although, specifying the ~TBB_VERSION=1~ +environment variable helps identify configuration issues quickly. + +* Alternative Handling for Missing System Topology +The other behavior in case HWLOC library cannot be found is to be more explicit +about the problem of a missing component and to either issue a warning or to +refuse working requiring one of the ~tbbbind~ variant to be loaded (e.g., throw +an exception). + +Comparing these alternative approaches to the one proposed. +** Common Advantages +- Explicitly indicates that the functionality being used does not work, + instead of failing silently. +- Avoids the need to distribute an additional variant of ~tbbbind~ library. + +** Common Disadvantages +- Requires additional step from the user side to resolve the problem. In other + words, it does not provide complete solution to the problem. + +*** Disadvantages of Issuing a Warning +- The warning may be unnoticed, especially if standard streams are + closed. + +*** Disadvantages of Throwing an Exception +- May break existing code that does not expect an exception to be thrown. +- Requires introduction of an additional exception hierarchy. + +* References +1. [[https://www.open-mpi.org/projects/hwloc/][HWLOC project main page]] +2. [[https://github.com/open-mpi/hwloc][HWLOC project repository on GitHub]]