Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add reference implementation for parallel_phase feature #1570

Open
wants to merge 35 commits into
base: master
Choose a base branch
from

Conversation

isaevil
Copy link
Contributor

@isaevil isaevil commented Nov 26, 2024

Description

Add a comprehensive description of proposed changes

Fixes # - issue number(s) if exists

Type of change

Choose one or multiple, leave empty if none of the other choices apply

Add a respective label(s) to PR if you have permissions

  • bug fix - change that fixes an issue
  • new feature - change that adds functionality
  • tests - change in tests
  • infrastructure - change in infrastructure and CI
  • documentation - documentation update

Tests

  • added - required for new features and some bug fixes
  • not needed

Documentation

  • updated in # - add PR number
  • needs to be updated
  • not needed

Breaks backward compatibility

  • Yes
  • No
  • Unknown

Notify the following users

List users with @ to send notifications

Other information

@isaevil isaevil changed the title Add reference implementation of parallel_block feature Add reference implementation for parallel_block feature Nov 26, 2024
@isaevil
Copy link
Contributor Author

isaevil commented Nov 27, 2024

@akukanov @aleksei-fedotov @vossmjp Could you please take a look at the PR in terms of implementation and ABI.

P.S. Tests are still WIP.

Copy link
Contributor

@aleksei-fedotov aleksei-fedotov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a complete review, just a couple of starters to think about.

src/tbb/waiters.h Outdated Show resolved Hide resolved
src/tbb/arena.h Outdated Show resolved Hide resolved
src/tbb/arena.h Outdated Show resolved Hide resolved
include/oneapi/tbb/task_arena.h Outdated Show resolved Hide resolved
@isaevil isaevil changed the title Add reference implementation for parallel_block feature Add reference implementation for parallel_phase feature Dec 4, 2024
Copy link
Contributor

@aleksei-fedotov aleksei-fedotov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bunch of comments from me. Have not looked at the tests yet. Also, not finished reviewing the PHASE_* switching logic because of found issue.

include/oneapi/tbb/task_arena.h Outdated Show resolved Hide resolved
include/oneapi/tbb/task_arena.h Outdated Show resolved Hide resolved
include/oneapi/tbb/task_arena.h Outdated Show resolved Hide resolved
src/tbb/arena.h Outdated Show resolved Hide resolved
src/tbb/arena.h Show resolved Hide resolved
src/tbb/arena.cpp Outdated Show resolved Hide resolved
src/tbb/arena.h Show resolved Hide resolved
src/tbb/arena.h Outdated Show resolved Hide resolved
src/tbb/arena.h Outdated Show resolved Hide resolved
src/tbb/waiters.h Outdated Show resolved Hide resolved
include/oneapi/tbb/task_arena.h Outdated Show resolved Hide resolved
include/oneapi/tbb/task_arena.h Outdated Show resolved Hide resolved
@isaevil isaevil marked this pull request as ready for review December 13, 2024 15:19
src/tbb/arena.cpp Outdated Show resolved Hide resolved
src/tbb/arena.cpp Show resolved Hide resolved
src/tbb/arena.h Outdated Show resolved Hide resolved
src/tbb/arena.h Outdated Show resolved Hide resolved
src/tbb/arena.h Outdated Show resolved Hide resolved
test/tbb/test_task_arena.cpp Outdated Show resolved Hide resolved
test/tbb/test_task_arena.cpp Outdated Show resolved Hide resolved
test/tbb/test_task_arena.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, I think need to cover in tests the following:

  • The DELAYED_LEAVE is returned back once new task is submitted.
    • Here I think we can test that the return of a worker after one-time-fast-leave happens generally longer than the following returns made after new task(s) is/are submitted.
  • Nested parallel phases with combinations of DELAYED and (ONE TIME) FAST leaves.

src/tbb/waiters.h Outdated Show resolved Hide resolved
dnmokhov
dnmokhov previously approved these changes Dec 20, 2024
Copy link
Contributor

@dnmokhov dnmokhov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@akukanov akukanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comments below are not critical; feel free to commit as-is and address later.

include/oneapi/tbb/task_arena.h Outdated Show resolved Hide resolved
include/oneapi/tbb/task_arena.h Outdated Show resolved Hide resolved
include/oneapi/tbb/task_arena.h Outdated Show resolved Hide resolved
src/tbb/arena.h Outdated Show resolved Hide resolved
src/tbb/arena.h Show resolved Hide resolved
src/tbb/arena.h Outdated Show resolved Hide resolved
src/tbb/arena.h Outdated Show resolved Hide resolved
@@ -95,6 +95,11 @@ TBB_EXPORT void __TBB_EXPORTED_FUNC isolate_within_arena(d1::delegate_base& d, s
TBB_EXPORT void __TBB_EXPORTED_FUNC enqueue(d1::task&, d1::task_arena_base*);
TBB_EXPORT void __TBB_EXPORTED_FUNC enqueue(d1::task&, d1::task_group_context&, d1::task_arena_base*);
TBB_EXPORT void __TBB_EXPORTED_FUNC submit(d1::task&, d1::task_group_context&, arena*, std::uintptr_t);

#if __TBB_PREVIEW_PARALLEL_PHASE
TBB_EXPORT void __TBB_EXPORTED_FUNC register_parallel_phase(d1::task_arena_base*, std::uintptr_t);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to use "register" and "unregister" in the names for the entry points? To me "registration" does not imply the start and end of the active time in the region, but might be used to indicate that a region just exists. For example, an athlete might register for a race but that doesn't mean they've started running.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main idea was to use more generic names, which do not depend on feature API. So, if API will change dramatically, entry point names would still make some sense.

Copy link
Contributor

@akukanov akukanov Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be alternative suitable generic verbs, such as "initiate", "enter", "commence", "launch", "open" for the beginning and "complete", "conclude", "close", "finish", "exit".

From the semantical point of view (i.e. speaking of a program execution phase), I would likely use "enter/exit" or "initiate/complete"; but since it's an internal entry point, I am OK with any choice made, including "register/unregister".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of the suggestions, I like "enter/exit".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akukanov @vossmjp Renamed entry points to "enter/exit".

auto median_automatic = utils::median(times_automatic.begin(), times_automatic.end());
auto median_fast = utils::median(times_fast.begin(), times_fast.end());

WARN_MESSAGE(median_automatic < median_fast,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all of the tests guaranteed to work (i.e. not have warnings) for hybrid systems where automatic is not delayed leave?

Copy link
Contributor Author

@isaevil isaevil Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good question. I suppose that only one test case(Parallel Phase retains workers in task_arena) would produce expected results on hybrid systems. Other tests don't make much sense on such configurations, but I think it is fine since these are warnings, not asserts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we skip the warning on hybrid systems? For example, don't output warning if there is more than 1 core type.

isaevil and others added 2 commits January 15, 2025 15:26
Co-authored-by: Alexey Kukanov <[email protected]>
Signed-off-by: Isaev, Ilya <[email protected]>
Signed-off-by: Isaev, Ilya <[email protected]>
@isaevil isaevil force-pushed the dev/pavelkumbrasev/parallel_block branch from c05d9a7 to 515fd33 Compare January 15, 2025 14:34
Signed-off-by: Isaev, Ilya <[email protected]>
Signed-off-by: Isaev, Ilya <[email protected]>
Signed-off-by: Isaev, Ilya <[email protected]>
@isaevil
Copy link
Contributor Author

isaevil commented Jan 15, 2025

@aleksei-fedotov @akukanov @vossmjp @pavelkumbrasev @dnmokhov I have updated the RFC document: included some technical details and conditions to leave the experimental stage.

Signed-off-by: Isaev, Ilya <[email protected]>
@@ -433,6 +528,9 @@ void arena::advertise_new_work() {
workers_delta = 1;
}

#if __TBB_PREVIEW_PARALLEL_PHASE
my_thread_leave.reset_if_needed();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the following assert be useful?

Suggested change
my_thread_leave.reset_if_needed();
__TBB_ASSERT(mandatory_delta > 0 || workers_delta > 0, "Potentially delaying fast leave?");
my_thread_leave.reset_if_needed();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not a valid assert. In case of workerless arena and spawn of the task both of those values will be zeroes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay... let it be then:

Suggested change
my_thread_leave.reset_if_needed();
__TBB_ASSERT((mandatory_delta > 0 || workers_delta > 0) ||
(is_arena_workerless() && work_type == work_spawned),
"Potentially delaying fast leave?");
my_thread_leave.reset_if_needed();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, I do not readily see how this assert is related to the parallel phase.
And if you move the assertion next to the code in the lines 522-529, it seems it just duplicates almost exactly the conditions and assignments there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I know that it duplicates. But the stress here is on the words "almost exactly". With the assert I sort of wanted to make sure we do not fall into recently discovered issue of dropping the fast leave setting when threads are recalled from arena. I don't mind not having an assert here, since, as you said, the conditions duplicate it. However, I do think we need to tell reader about the desired behavior here, it feels like the code itself might not be good at telling that.

Another interesting consideration is to reduce cross-NUMA accesses as much as possible. If thread can base its decision whether it needs to read and/or modify global-to-arena variables on the data it has already read, then it makes sense to avoid doing more re-reads of other global-to-arena data. Of course, if we are sure that the set of "almost exactly" is empty here, then the thread already avoids unnecessary touches to the data that increases cross-NUMA traffic.

Copy link
Contributor

@akukanov akukanov Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assertions are good to enforce preconditions for a function, and in some cases of sophisticated internal logic also good to check expected invariants in certain point. Having an assertion that checks the conditions explicitly set by the code right above is a pure duplication. The part that is not an exact duplication could be checked in a proper place, of course; also reset_if_needed could internally assert on its expected preconditions.

For the NUMA part of the comment, I cannot make the connection to the assertion being discussed.

src/tbb/arena.h Outdated Show resolved Hide resolved
include/oneapi/tbb/task_arena.h Show resolved Hide resolved
Signed-off-by: Isaev, Ilya <[email protected]>
Copy link
Contributor

@aleksei-fedotov aleksei-fedotov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Copy link
Contributor

@akukanov akukanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that might also simplify the implementation logic.

Below is how. Note that separation of significant bits for different leave states was a necessary prerequisite to operate with ONE_TIME_FAST_LEAVE as a single bit flag.

src/tbb/arena.h Outdated
static const std::uintptr_t ONE_TIME_FAST_LEAVE = 1 << 1;
static const std::uintptr_t PARALLEL_PHASE = 1 << 2;

std::atomic<std::uintptr_t> my_state{static_cast<std::uintptr_t>(-1)};
Copy link
Contributor

@akukanov akukanov Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to use UINTPTR_MAX instead of this cast of -1, here and in other places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

src/tbb/arena.h Show resolved Hide resolved
src/tbb/arena.h Outdated
Comment on lines 204 to 211
void reset_if_needed() {
std::uintptr_t curr = ONE_TIME_FAST_LEAVE;
if (my_state.load(std::memory_order_relaxed) == curr) {
// Potentially can override decision of the parallel block from future epoch
// but it is not a problem because it does not violate the correctness
my_state.compare_exchange_strong(curr, DELAYED_LEAVE);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In reset, you only need to drop the one-time flag, correct? Since now it does not intersect with other states, it should be enough to just fetch_and(~ONE_TIME_FAST_LEAVE).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the desire is to drop the one-time flag only if a new work is submitted in the empty arena. So, the flag cannot be blindly dropped.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, at first glance it seems like @akukanov suggestion shouldn't contradict the desire. Since unregister_parallel_phase sets this bit only on the last parallel phase completion, when new work submission happens the state will be either still ONE_TIME_FAST_LEAVE, or already include some amount of parallel phase references, which means that this bit will be already dropped by some of the first parallel phases.

Copy link
Contributor

@akukanov akukanov Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, if the arena is not empty, then the flag should have been dropped already, and fetch_and would just do nothing. We can still leave the load and the check if it gives a proven performance advantage, but certainly compare_exchange_strong is not needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced CAS by fetch_and.

src/tbb/arena.h Outdated
Comment on lines 219 to 231
do {
// Need to add a reference for this start of a parallel phase, preserving the leave
// policy. Except for the case when one time fast leave was requested at the end of a
// previous phase.
desired = prev;
if (prev == ONE_TIME_FAST_LEAVE) {
// State was previously transitioned to "One-time Fast leave", thus with the start
// of new parallel phase, it should be transitioned to "Delayed leave"
desired = DELAYED_LEAVE;
}
__TBB_ASSERT(desired + PARALLEL_PHASE > desired, "Overflow detected");
desired += PARALLEL_PHASE; // Take into account this start of a parallel phase
} while (!my_state.compare_exchange_strong(prev, desired));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it can be simplified as fetch_add(PARALLEL_PHASE) followed by fetch_and(~ONE_TIME_FAST_LEAVE) in case the bit is set in the returned value of fetch_add.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me. Once we have added the reference to parallel phase it should be safe to drop the bit and, just like in case with reset_if_needed method function, trying to drop already dropped flag should do no harm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

src/tbb/arena.h Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants