-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathLibomptargetInterface.tex
executable file
·377 lines (283 loc) · 26.8 KB
/
LibomptargetInterface.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
\section{The libomptarget.so synchronous interface}\label{sc:LibomptargetSync}
The offload library implements the OpenMP 4.5 user-level runtime library routines:
\begin{itemize}
\item \code{int omp_get_num_devices(void)}
\item \code{int omp_get_initial_device(void)}
\item \code{int omp_is_initial_device(void)}
\end{itemize}
The offload library implements the following compiler-level runtime library routines:
\begin{itemize}
\item \code{void __tgt_register_lib(__tgt_bin_desc* desc)}\\
Register the \libomptarget{} library and initialize target state (i.e. global variables and target entry points) for the current host shared library/executable and the corresponding target execution images that have those entry points implemented. This does not trigger any execution in any target as any real work with the target device can be postponed until the first target region is encountered during execution. This function is expected to appear only once per host shared library/executable in the .init section and is called before any constructors or static initializers are called for the host. The name of the caller of \code{__tgt_register_lib} follows the same pattern of the C++ initializers in Clang and is set to \code{_GLOBAL__A_000000_OPENMPTGT}.
\item \code{void __tgt_target_data_begin(int32_t device_id, int32_t num_args, void** args_base, void** args, int64_t *args_size, int32_t *args_maptype)}\\
Initiate a device data environment. It maps variables from the host data environment to the device data environment by recording the mapping between the references of variables used in the host and target into the \libomptarget{} internal structures. The associated variables in the target device data environment are initialized according to the map-type. In the event a given associated variable has already been mapped in other enclosing device data environment, no action is taken for that variable.
\item \code{void __tgt_target_data_end(int32_t device_id, int32_t num_args, void** args_base, void** args, int64_t *args_size, int32_t *args_maptype)}\\
Close a device data environment. It removes mapped variables from the current device data environment, releases target memory and destroy the mappings created by \code{__tgt_target_data_begin()} that initiated the current device data environment. It assigns host variables the value of the corresponding device data environment variable according to the map-type.
\item \code{void __tgt_target_data_update(int32_t device_id, int32_t num_args, void** args_base, void** args, int64_t *args_size, int32_t *args_maptype)}\\
Make the value of a set of variables consistent between the host device and a target device. If the variable's map type is \mfrom{}, use the value of the variable on the target device. If a variable’s map type is \mto{}, use the value of the variable on the host device.
\item \code{int32_t __tgt_target(int32_t device_id, void *host_addr, int32_t num_args, void** args_base, void** args, int64_t *args_size, int32_t *args_maptype)}\\
Perform the same actions as \code{__tgt_target_data_begin} in case \code{arg_num} is non-zero and launch the execution of the target region on the target device; if \code{arg_num} is non-zero after the region execution is done it also performs the same action as \code{__tgt_target_data_end} above. If offloading fails, an error code is returned, which notifies the caller that the associated target region has to be executed by the host. The return code can be used as an error code which will give the compiler and run-time the freedom to implement optimized behaviors.
\item \code{int32_t __tgt_target_teams(int32_t device_id, void *host_addr, int32_t num_args, void** args_base, void** args, int64_t *args_size, int32_t *args_maptype, int32_t num_teams, int32_t thread_limit)}\\
This is an extension of \code{__tgt_target} where the caller is able to specify the (maximum) number of teams and threads in each team that should be used by \libomptarget{} when it launches the execution of the region. It reflects the common nesting of \dtarget{} and \dteams{} directives in OpenMP 4.5.
\end{itemize}
All the \code{__tgt_target...} calls presented above perform an initial check to understand if the target specified by \code{device_id} was already initialized, and if not, triggers that initialization. The information registered by \code{__tgt_register_lib} is used to accomplish that.
\subsection{Arguments for the libomptarget.so calls}
The following describes the arguments used by the \libomptarget{} interface and how they are used.
\noindent\rule{\textwidth}{0.4pt}
\code{__tgt_bin_desc* desc} points to a constant data struct statically defined by the compiler:
\begin{lstlisting}
struct __tgt_bin_desc{
uint32_t NumDevices;
__tgt_device_image *DeviceImages;
__tgt_offload_entry *EntriesBegin;
__tgt_offload_entry *EntriesEnd;
};
\end{lstlisting}
\code{NumDevices} is the number of device types whose execution image was generated by the compiler in order to implement some of the offloading entry points. The device types are specified by the user during the invocation of the compiler by passing the group option \command{–target-offload=Ti} where \command{Ti} is the triple of a target the user wants to support. If $n$ group options \command{–target-offload=Ti} associated with different device triples are specified when the compiler is invoked, \code{NumDevices} will $n$.
\code{DeviceImages} is a pointer to an array of \code{NumDevices} elements, whose element type is:
\begin{lstlisting}
struct __tgt_device_image{
void *ImageStart;
void *ImageEnd;
__tgt_offload_entry *EntriesBegin;
__tgt_offload_entry *EntriesEnd;
};
\end{lstlisting}
%
where \code{ImageStart} and \code{ImageEnd} contain the addresses where a target image associated to the current host executable/shared library for a given device type starts and ends, respectively. \code{ImageEnd} is non-inclusive, i.e. it points to the byte immediately after the target image ends.
\code{EntriesBegin} and \code{EntriesEnd} point to the first and last element of an array that contains the information of each global variable and target entry point that require a map between host and target. These pointers are present in both \code{__tgt_bin_desc} and \code{__tgt_device_image} so that the target dependent runtime (see Section 5) can use this information as well to more easily retrieve the entries from the target image (e.g. to retrieve the symbol names of the entries – see below).
Each element of the array pointed by \code{EntriesBegin} has type:
\begin{lstlisting}
struct __tgt_offload_entry{
void *addr;
char *name;
int64_t size;
};
\end{lstlisting}
%
where \code{addr} is the address of that global variable or entry point in the host, \code{name} is the name of the
symbol that refers to that global variable or entry point, and \code{size} is the size in bytes of the global variable or zero if it is an entry point. If the address field is set to \code{NULL}, the corresponding entry in the table is not supported by the target device associated to this table.
\noindent\rule{\textwidth}{0.4pt}
\libomptarget{} has to be able to map the host entries to the corresponding device entries.
There are different strategies that can be used for mapping the entries. The simplest one is to use the names of the entries for the mapping by performing a name-based search using the name field of the \code{__tgt_offload_entry}. The field name is useful for targets whose runtime requires access to the symbol names in order to locate the correspondent address in the target image. The array that starts at \code{EntriesBegin} is built by the compiler in conjunction with the linker, which forwards sequences of entries of different compilation units to the same binary section. In the event target and host toolchains can provide strict ordering for both target and host tables – then the mapping can be done by the sequence number of the entry. The frontend ensures that the global variables/entries follow the same order for both host and device. If the toolchain of a given target does not preserve the order, that target runtime may consult the host entries and obtain the same order of the symbols based on the name.
\noindent\rule{\textwidth}{0.4pt}
Static initializers and global destructors associated with device entries are implemented as target regions and executed in the same order they are required in the program. The callers of the constructors and destructors are always launched with a single thread and team. Example~\ref{ex:EntriesOrdering} shows a code snippet that would require the creation of these target regions.
\begin{example}
\lstset{basicstyle=\scriptsize,frame=single}
\begin{lstlisting}
#pragma omp declare target
class C{
C() {// ctor of C}
~C() {//dtor of C}
};
C a;
#pragma omp end declare target
foo(){
#pragma omp target
{ //target region 1 }
}
#pragma omp declare target
C b;
#pragma omp end declare target
bar(){
#pragma omp target
{ //target region 2 }
}
\end{lstlisting}
\lstset{basicstyle=\small\bfseries,frame=none}
\caption{Example requiring the creation of target regions to implement the device variables global initializers and destructors.}
\label{ex:EntriesOrdering}
\end{example}
%
\noindent\rule{\textwidth}{0.4pt}
\code{int32_t device_id} is an integer that uniquely identifies a given target. In the first call of \code{__tgt_register_lib}, \libomptarget{} detects the available target dependent RTLs in the system and uses them to query the number of devices of each type that are ready to be used. If $A$ devices of type \command{T1} and $B$ devices of type \command{T2} are found, where \command{T1} and \command{T2} are the types specified in that exact order by the user with \command{-–target-offload=Ti}, \code{device_id} $[0,A[$ will map to devices of type \command{T1} and \code{device_id} $[A,A+B[$ will map to devices of type \command{T2}. If using separate compilation, the user must use the same order of triples in the separate compiler runs, otherwise he cannot assume a unique map between device IDs and types. If \code{device_id} is greater or equal than $A+B$, the call where it is used will fail (region would be executed by the host) and no further action will be taken by \libomptarget{}. On top of the positive values used for \code{device_id}, the compiler also employs three reserved values:
\begin{itemize}
\item \code{device_id = -1}: informs the runtime that the user has not specified any device ID, and therefore the default must be used, which may be specified through an environment variables as specified in the OpenMP 4.5 specification.
\item \code{device_id = -2}: informs the runtime that the target action must be performed on all available devices and can be delayed until the first time the device action is invoked with a device ID greater or equal to \code{-1}. This is mainly used to call C++ global initializers, which only need to be called if the device is eventually used for executing at least one target region.
\item \code{device_id = -3}: informs the runtime that the target action must be performed on all available devices that were ever used in the current library. This is mainly used to call C++ destructors, which are only required if that device was used before.
\end{itemize}
\noindent\rule{\textwidth}{0.4pt}
\code{int32_t num_args} is the number of data pointers that require a mapping.
\noindent\rule{\textwidth}{0.4pt}
\code{void** args} is a pointer to an array with \code{num_args} arguments whose elements point to the first byte of the array section that needs to be mapped.
\noindent\rule{\textwidth}{0.4pt}
\code{int64_t* args_size} is a pointer to an array with \code{num_args} arguments whose elements contain the size in bytes of the array section to be mapped.
\noindent\rule{\textwidth}{0.4pt}
\code{void** args_base} is a pointer to an array with \code{num_args} arguments whose elements point to the base address of the declaration the mapping refers to. \code{args_base} differs from args if an array section does not start at zero. \libomptarget{} needs to know the base addresses in order to relate mapped data with target region arguments that are dereferenced in the target region body.
\noindent\rule{\textwidth}{0.4pt}
\code{int32_t *args_maptype} is a pointer to an array with \code{num_args} arguments whose elements contain the required map attributes as specified in the enum:
\begin{lstlisting}
enum tgt_map_type {
OMP_TGT_MAPTYPE_TO = 0x0001,
OMP_TGT_MAPTYPE_FROM = 0x0002,
OMP_TGT_MAPTYPE_ALWAYS = 0x0004,
OMP_TGT_MAPTYPE_DELETE = 0x0008,
OMP_TGT_MAPTYPE_MAP_PTR = 0x0010,
OMP_TGT_MAPTYPE_FIRST_REF = 0x0020,
OMP_TGT_MAPTYPE_RETURN_PTR = 0x0040,
OMP_TGT_MAPTYPE_PRIVATE_PTR = 0x0080,
OMP_TGT_MAPTYPE_PRIVATE_VAL = 0x0100
};
\end{lstlisting}
\sloppy
\code{OMP_TGT_MAPTYPE_TO} instructs the runtime to copy host data to the device in a \code{__tgt_target_data_begin}, \code{__tgt_target_update}, or before the kernel execution is triggered byt \code{__tgt_target}. \code{OMP_TGT_MAPTYPE_FROM} instructs the runtime to copy device data to the host in a \code{__tgt_target_data_end}, \code{__tgt_target_update}, or when finalizing \code{__tgt_target}. \code{OMP_TGT_MAPTYPE_ALWAYS} forces the copying regardless of the reference count associated with the map. \code{OMP_TGT_MAPTYPE_DELETE} forces the unmapping of the object in a target data end or when completing a target construct. \code{OMP_TGT_MAPTYPE_MAP_PTR} forces the runtime to map the pointer variable as well as the pointee variable. This attribute also instructs the runtime to initialize the value of the pointer on the device with the base device address of the pointee mapped variable. The attributes \code{OMP_TGT_MAPTYPE_TO}, \code{OMP_TGT_MAPTYPE_FROM}, \code{OMP_TGT_MAPTYPE_ALWAYS}, and \code{OMP_TGT_MAPTYPE_DELETE} only apply to the pointee variable.
\code{OMP_TGT_MAPTYPE_FIRST_REF} instructs the runtime that it is the first occurrence of this mapped variable within this construct, or the associated map information is the first one that relates with that variable (the map of a variable can result in multiple sets of map information to be passed to the runtime). When used with the \code{OMP_TGT_MAPTYPE_MAP_PTR}, it instructs the runtime that it is the first occurrence of the pointer mapped variable; a pointee mapped variable is considered by definition as its first occurrence. These attributes are used to determine when to update the reference count associated with a mapped variable, and when to include the base device address of a mapped variable in the device instance of a pointer variable.
\code{OMP_TGT_MAPTYPE_RETURN_PTR} instructs the runtime to return the base device address of the mapped variable in the corresponding \code{base_args} entry. When used with the \code{OMP_TGT_MAPTYPE_MAP_PTR} attribute, the runtime returns the base device address of the pointer mapped variable. This attribute is used to implement the \code{use_device_ptr} clause.
The next two flags indicate to the runtime that it is not dealing with mapped variables. First, \code{OMP_TGT_MAPTYPE_PRIVATE_PTR} informs the runtime that the described variable is a private variable. First-private variables are indicated by using the \code{OMP_TGT_MAPTYPE_TO} attribute in conjunction with the private attribute. The runtime is responsible for allocating the device memory associated with the variable, but it is not mapped as it is private. Second, \code{OMP_TGT_MAPTYPE_PRIVATE_VAL} instructs the runtime to simply forward the value of the \code{args_base} parameter to the target construct. This attribute is used to implement the \code{is_device_ptr} clause. This attribute can also be used by the compiler to forward small first-private values directly to the device via the kernel parameters. No allocation of memory occurs with this attribute.
The fields in the interface are used in three different ways depending on the presence of the \code{OMP_TGT_MAPTYPE_MAP_PTR} and \code{OMP_TGT_MAPTYPE_PRIVATE_VAL} attributes.
%
\begin{itemize}
\item Variables without the \code{OMP_TGT_MAPTYPE_MAP_PTR} and \code{OMP_TGT_MAPTYPE_PRIVATE_VAL} attributes. \code{args_base} contain the references to the variables that are being mapped. \code{args} points to the beginning of the data as requested by the user. For scalars, \code{args_base} and \code{args} are identical. For arrays, the user may request a subset of arrays, in which case \code{args} points to the beginning of these array subsets. Similarly, for struct, the user may request to map only a subset of the struct fields, in which case \code{args} points to the first fields of these structs. \code{args_size} contains the sizes of the variables, or the appropriate subsets of the variables, that are being mapped. \code{args_maptype} contains the attributes that control the data movement between host and device.
\item Variables with the \code{OMP_TGT_MAPTYPE_MAP_PTR} attribute and without the \code{OMP_TGT_MAPTYPE_PRIVATE_VAL} attribute. Arguments for mapped pointers describe both the pointer and pointee variables. \code{args_base} contains the references to the pointer variables to be mapped. If needed, the references to the pointee variables are found by dereferencing the \code{args_base} pointers. \code{args} contains the beginning of the data of the pointee variables. For arrays and structs, \code{args} points to the beginnings of the subsets of the pointee variables that are being mapped. \code{args_size} contains the sizes of the pointee variables, or the appropriate subset of the variables, that are being mapped. \code{args_maptype} contains the attributes that control the data movement of the pointee variable between host and device. By definition, the values of the mapped pointer variables on the device are never copied back to the host.
\item Variables with the \code{OMP_TGT_MAPTYPE_PRIVATE_VAL} attribute. The values to forward are found in the \code{args_base} arguments; \code{args} and \code{args_size} values are unused.
%
\end{itemize}
\fussy
Zero-length array references are identified by their \code{args_size} values of zero; and since they cannot have an offset, their \code{args} values are unused.
Table~\ref{tb:PointersMapping} shows the content of the arrays passed to \code{__tgt_target_data_begin()} and \code{__tgt_target} as result of the \dtargetdata{} and \dtarget{} pragmas in Example~\ref{ex:PointersMapping}.
Variable \code{i} is also passed to \code{__tgt_target} because it is captured in the body of the target region and are therefore arguments to the target region. Variable \code{pA} is also captured, and as captured pointers, it is mapped as a zero-length array. The \libomptarget{} implementation will detect that \code{s1} was mapped before and will not take any action to map this variable again. Variable \code{C} illustrates the use of \code{use_device_ptr} and \code{is_device_ptr}.
The arguments that are used to invoke the target kernel (see void \code{*target_vars_ptr} in Section~\ref{sc:TargetRTLInterface}) consist of the mapped base address of all elements. In the example above the arguments would be \code{&di, &dpA, &dA[0], &ds1, &dC}, where \code{dX} is the map of {X} in the device.
\begin{example}
\lstset{basicstyle=\scriptsize,frame=single}
\begin{lstlisting}
// N, M and S are constants
foo(){
int A[N], D[N], *pA, i;
struct S1 {int x, y, B[M], *pB, u, v; } s1;
int C[N];
pA = (int*)malloc(N*sizeof(int));
s1.pb = (int*)malloc(M*sizeof(int));
#pragma omp target data map(pA[0:M], s1, C) use_device_ptr(C)
{
/* C now correspond to the device pointer associated with mapped variable C, and can be used in a Cuda call, for example */
#pragma omp target map(to: A[S:M], s1.B[0:M]) map(from:s1.pb[0:M]) is_device_ptr(C) firstPrivate(D[0:M])
{
for (i=S; i<M; ++i){
++A[i];
--pA[i];
s1.pb[i-S] = s1.B[i-S] + A[i]*pA[i] - C[i] * D[i];
}
}
}
}
\end{lstlisting}
\lstset{basicstyle=\small\bfseries,frame=none}
\caption{Example requiring mapping of pointer.}
\label{ex:PointersMapping}
\end{example}
\begin{table}
\centering
\begin{tabular}{c|c|c|p{3cm}|p{3.5cm}}
%
\code{args_base} &
\code{args} &
\code{args_size} &
\code{args_maptype} &
comment \\\hline
%
\multicolumn{4}{c}{\cellcolor{gray!20}\code{\#pragma omp target data - __tgt_target_data_begin()}} \\\hline
% map(pA[S:M])
\code{pA} &
\code{&pA[0]} &
\code{M*sizeof(int)} &
\code{TO | FROM | FIRST_REF} &
mapped array subset\\\hline
% map(s1)
\code{&s1} &
\code{&s1} &
\code{sizeof(S1)} &
\code{TO | FROM | FIRST_REF} &
mapped struct\\\hline
% map(C) use_device_ptr(C)
\code{&C} &
\code{&C} &
\code{N*sizeof(int)} &
\code{TO | FROM | FIRST_REF | RETURN_PTR} &
mapped array for which the compiler want the device address\\\hline
%
\multicolumn{4}{c}{\cellcolor{gray!20}\code{\#pragma omp target - __tgt_target()}} \\\hline
% map(to: pA[0:0])
\code{pA} &
\code{undefined} &
\code{0} &
\code{TO | FROM| FIRST_REF} &
zero-length first-private pointer to a mapped array\\\hline
% map(to: A[S:M])
\code{&A} &
\code{&A[S]} &
\code{M*sizeof(int)} &
\code{TO | FIRST_REF} &
mapped array\\\hline
% map(to s1.B[0:M] ... s1.pb)
\code{&s1} &
\code{&s1.B[0]} &
\code{M*sizeof(int)+sizeof(void *)} &
\code{FIRST_REF} &
allocation of mapped struct subset\\\hline
% map(to s1.B[0:M])
\code{&s1} &
\code{&s1.B[0]} &
\code{M*sizeof(int)} &
\code{TO} &
memory move for subset of mapped struct\\\hline
% map(from s1.pB[0:M])
\code{&s1.pB} &
\code{&s1.pB[0]} &
\code{M*sizeof(int)} &
\code{FROM | MAP_PTR} &
mapped pointer and pointee array subset\\\hline
% is_device_ptr(C)
\code{C} &
\code{undefined} &
\code{undefined} &
\code{PRIVATE_VAL} &
mapped pointer and pointee array subset\\\hline
% firstprivate(i)
\code{i} &
\code{undefined} &
\code{undefined} &
\code{PRIVATE_VAL} &
first-private scalar passed as value\\\hline% firstprivate(D)
\code{&D} &
\code{&D[0]} &
\code{M*sizeof(int)} &
\code{PRIVATE_PTR | TO} &
first-private array\\\hline
%
\end{tabular}
\caption{Contents of the arrays passed through the interface for Example~\ref{ex:PointersMapping}. \code{OMP_TGT_MAPTYPE_} prefixes are omitted for conciseness. }
\label{tb:PointersMapping}
\end{table}
%%%%%%%%%%%%
%%%%%%%%%%%%
%%%%%%%%%%%%
\section{The libomptarget.so asynchronous interface} \label{sc:LibomptargetAsync}
This section describes the \libomptarget{} interface dedicated to handle asynchronous target task constructs. We first introduce an abstract data type to hide the implementation details of the RTL-specific synchronization events to the RTL interface.
\begin{itemize}
\item \code{typedef void * tgt_event_t;}
\end{itemize}
We also introduce the signature of a callback function that will be used by \libomptarget{} to indicate to the native runtime that a given asynchronous target task has fully completed.
\begin{itemize}
\item \code{typedef void * tgt_callback_param_t;}
\item \code{typedef void (*tgt_callback_fct)(tgt_callback_param_t param);}
\end{itemize}
If a native runtime elects to offload the enforcement of target-task-to-target-task dependences to a device, it can use the following functions to enable device enforced dependences.
\begin{itemize}
\item \code{void __tgt_init_dep(tgt_callback_fct completion_callback)} \\
This call indicates to the offloading infrastructure the callback function that must be called once an async target task has completed. This initialization call must be performed once, prior to processing any asynchronous operations.
\item \code{int32_t __tgt_enforce_dep(int32_t from_device_id, int32_t to_device_id, tgt_event_t completion_event)} \\
This call returns an integer that identifies if a dependence from a first target task (with device ID \code{from_device_id} and associated with a completion synchronization event \code{completion_event}) to a second target task (scheduled to execute on device with device ID \code{to_device_id}) can be satisfied by the devices. If the call returns a zero value, this dependence must be enforced on the host. If the call returns a nonzero value, this dependence must now be enforced by the devices.
\item \code{void __tgt_release_dep(tgt_sync_event_t completion_event)} \\
This call indicates to the offloading libraries that the native runtime will issue no further requests to enforce dependences tagged with the synchronization event \code{completion_event}. This call enables the offloading libraries to manage the live ranges of the data structures associated with synchronization events.
\item \code{int32_t __tgt_process_dep()} \\
This call enables the offloading libraries to process dependences. The call returns zero if no further dependence processing is required; nonzero values indicates that a further call may be required in the future. The native runtime must call this function while a host task is waiting for some target-task dependences to be resolved by the devices. This function enables the offloading libraries to perform polling on completed synchronization events that requires host-side processing, for example to cleanup the host side dependence data structures.
\end{itemize}
The following calls are asynchronous versions of \libomptarget{} calls seen in Section~\ref{sc:LibomptargetSync}.
\begin{itemize}
\item \code{void __tgt_target_async_data_begin(int32_t device_id, int32_t num_args, void** args_base, void** args, int64_t *args_size, int32_t *args_maptype, int32_t num_dep_events, tgt_event_t *dep_events, tgt_event_t *completion_event, tgt_callback_param_t param)}
\item \code{void __tgt_target_async_data_end(int32_t device_id, int32_t num_args, void** args_base, void** args, int64_t *args_size, int32_t *args_maptype, int32_t num_dep_events, tgt_event_t *dep_events, tgt_event_t *completion_event, tgt_callback_param_t param)}
\item \code{void __tgt_target_async_data_update(int32_t device_id, int32_t num_args, void** args_base, void** args, int64_t *args_size, int32_t *args_maptype, int32_t num_dep_events, tgt_event_t *dep_events, tgt_event_t *completion_event, tgt_callback_param_t param)}
\item \code{int32_t __tgt_target_async(int32_t device_id, void *host_addr, int32_t num_args, void** args_base, void** args, int64_t *args_size, int32_t *args_maptype, int32_t *args_maptype, int32_t num_dep_events, tgt_event_t *dep_events, tgt_event_t *completion_event, tgt_callback_param_t param)}
\item \code{int32_t __tgt_target_async_teams(int32_t device_id, void *host_addr, int32_t num_args, void** args_base, void** args, int64_t *args_size, int32_t *args_maptype, int32_t num_teams, int32_t thread_limit, int32_t *args_maptype, int32_t num_dep_events, tgt_event_t *dep_events, tgt_event_t *completion_event, tgt_callback_param_t param)}
\end{itemize}
Operations for the above calls will only occur after the \code{num_dep_events} synchronization events listed in \code{dep_events} are satisfied. The call returns in \code{completion_event} the synchronization event indicating the completion of the operations. Upon completion, the offloading interface will callback the runtime using the given callback parameter \code{param} provided in the above calls.