nvptx-none-run: CU_LIMIT_STACK_SIZE #8

tschwinge · 2015-02-12T10:09:10Z

For nvptx-none toolchain testing, we're using nvptx-none-run to launch kernels on a 1 x 1 x 1 grid with 1 x 1 x 1 threads. We'd like to use cuCtxSetLimit(CU_LIMIT_STACK_SIZE) to increase the per-thread stack size from its tiny default value (1 KiB?).

Even though a cuCtxGetLimit(CU_LIMIT_STACK_SIZE) does acknowledge the value set, if this is set "too high", inscrutable errors (CUDA_ERROR_ILLEGAL_ADDRESS) may result from later cuModuleLoadData (?!) or cuLaunchKernel calls.

It is unclear how to safely maximize the per-thread stack size.

The text was updated successfully, but these errors were encountered:

... to work around <#8>. According to Table 12, Technical Specifications per Compute Capability, on <http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications>, there is 512 KiB of local memory per thread, so a stack with 256 KiB seemed workable (and indeed is, with Nvidia Quadro K1000M hardware, driver version 340.24, CUDA 5.5 installation), but not on a system with Nvidia Tesla K20c hardware, driver version 319.37, CUDA 5.5 installation. On the Nvidia Quadro K1000M system, no changes in GCC testsuite results.

It appears that memory accounting in CUDA drivers can be very conservative, sometimes applying stack reservation for the maximum amount of threads the device can host. If so, even the current limit of 128 KiB is too much for some Maxwell devices. Instead of hardcoding some limit, calculate it at run time based on available memory and maximum thread count. Allow the user to override this automatic guess on the command line.

amonakov · 2016-01-29T15:07:48Z

Your notes and my experience suggest that failures have to do with the driver making "worst-case" stack size accounting: per-thread stack size multiplied by the number of threads the device can host simultaneously. Pull request #10 implements a variant of safe-ish maximization based on that.

tschwinge · 2022-10-11T15:19:26Z

Just a few years later, I've finally collected enough evidence that this needs to be re-visited. ;-)

For example, on a laptop where the Nvidia Quadro P1000 is also used for graphics (Xorg, GNOME Shell, Firefox), per nvidia-smi, currently 3127MiB / 4096MiB memory are in use.
Any nvptx-none-run fails: could not set stack limit: out of memory (CUDA_ERROR_OUT_OF_MEMORY, 2).

Having a look, via cuDeviceTotalMem it calculates the default stack size based on the total device memory (minus heap size, minus "128 MiB extra", divided by 4 SMs, 2048 threads each) to ~460 KiB, then "limit default size to 128 KiB maximum".
That, however is too much, because that computation actually should be done not with the total but on the free device memory.
For example, see https://stackoverflow.com/a/72382150/664214: "calculation which must be satisified".
(This is racy, obviously; see also #27 "[nvptx-run] Add --verbose/-v", which we then also shall re-visit...)
Manually doing the calculation, I get a limit in the order of ~90 KiB.
And indeed: a lauch with --stack-size 100000 works, but --stack-size 110000 fails.
I think we should do:

-r = CUDA_CALL_NOCHECK (cuDeviceTotalMem, &mem, dev);
+r = CUDA_CALL_NOCHECK (cuMemGetInfo, /* free */ &mem, /* total */ NULL);

I'm confirming this to calculate an (even lower) stack size of ~60 KiB.

... which brings us to the next question: how low should we go, should there be minimum (default) stack size that we enforce, so that we reliably get could not set stack limit: out of memory (CUDA_ERROR_OUT_OF_MEMORY, 2) instead of an inscrutable an illegal memory access was encountered (CUDA_ERROR_ILLEGAL_ADDRESS, 700)?

@amonakov, @vries: any thoughts?

…emory [#8] <#8 (comment)>: > [...], on a laptop where the Nvidia Quadro P1000 is also used for > graphics (Xorg, GNOME Shell, Firefox), per `nvidia-smi`, currently > `3127MiB / 4096MiB` memory are in use. > Any `nvptx-none-run` fails: > `could not set stack limit: out of memory (CUDA_ERROR_OUT_OF_MEMORY, 2)`. > > Having a look, via `cuDeviceTotalMem` it calculates the default stack > size based on the total device memory (minus heap size, minus > "128 MiB extra", divided by 4 SMs, 2048 threads each) to ~460 KiB, then > "limit default size to 128 KiB maximum". > That, however is too much, because that computation actually should be > done not with the *total* but on the *free* device memory. > For example, see <https://stackoverflow.com/a/72382150/664214>: > "calculation which must be satisified". > (This is racy, obviously; [...].) > Manually doing the calculation, I get a limit in the order of ~90 KiB. > And indeed: a lauch with `--stack-size 100000` works, but > `--stack-size 110000` fails. > [Via `cuMemGetInfo`, `mem_free`] I'm confirming this to calculate an > (even lower) stack size of ~60 KiB.

amonakov mentioned this issue Jan 29, 2016

Multiple fixes #10

Merged

tschwinge closed this as completed Feb 7, 2016

tschwinge reopened this Oct 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvptx-none-run: CU_LIMIT_STACK_SIZE #8

nvptx-none-run: CU_LIMIT_STACK_SIZE #8

tschwinge commented Feb 12, 2015

amonakov commented Jan 29, 2016

tschwinge commented Oct 11, 2022

nvptx-none-run: CU_LIMIT_STACK_SIZE #8

nvptx-none-run: CU_LIMIT_STACK_SIZE #8

Comments

tschwinge commented Feb 12, 2015

amonakov commented Jan 29, 2016

tschwinge commented Oct 11, 2022