Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvptx-none-run: CU_LIMIT_STACK_SIZE #8

Open
tschwinge opened this issue Feb 12, 2015 · 2 comments
Open

nvptx-none-run: CU_LIMIT_STACK_SIZE #8

tschwinge opened this issue Feb 12, 2015 · 2 comments

Comments

@tschwinge
Copy link
Member

For nvptx-none toolchain testing, we're using nvptx-none-run to launch kernels on a 1 x 1 x 1 grid with 1 x 1 x 1 threads. We'd like to use cuCtxSetLimit(CU_LIMIT_STACK_SIZE) to increase the per-thread stack size from its tiny default value (1 KiB?).

Even though a cuCtxGetLimit(CU_LIMIT_STACK_SIZE) does acknowledge the value set, if this is set "too high", inscrutable errors (CUDA_ERROR_ILLEGAL_ADDRESS) may result from later cuModuleLoadData (?!) or cuLaunchKernel calls.

It is unclear how to safely maximize the per-thread stack size.

tschwinge added a commit that referenced this issue Feb 13, 2015
... to work around <#8>.

According to Table 12, Technical Specifications per Compute Capability, on
<http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications>,
there is 512 KiB of local memory per thread, so a stack with 256 KiB seemed
workable (and indeed is, with Nvidia Quadro K1000M hardware, driver version
340.24, CUDA 5.5 installation), but not on a system with Nvidia Tesla K20c
hardware, driver version 319.37, CUDA 5.5 installation.

On the Nvidia Quadro K1000M system, no changes in GCC testsuite results.
amonakov added a commit to amonakov/nvptx-tools that referenced this issue Jan 29, 2016
It appears that memory accounting in CUDA drivers can be very conservative,
sometimes applying stack reservation for the maximum amount of threads the
device can host.  If so, even the current limit of 128 KiB is too much for
some Maxwell devices.  Instead of hardcoding some limit, calculate it at
run time based on available memory and maximum thread count.  Allow the user
to override this automatic guess on the command line.
@amonakov amonakov mentioned this issue Jan 29, 2016
@amonakov
Copy link
Contributor

Your notes and my experience suggest that failures have to do with the driver making "worst-case" stack size accounting: per-thread stack size multiplied by the number of threads the device can host simultaneously. Pull request #10 implements a variant of safe-ish maximization based on that.

@tschwinge
Copy link
Member Author

Just a few years later, I've finally collected enough evidence that this needs to be re-visited. ;-)

For example, on a laptop where the Nvidia Quadro P1000 is also used for graphics (Xorg, GNOME Shell, Firefox), per nvidia-smi, currently 3127MiB / 4096MiB memory are in use.
Any nvptx-none-run fails: could not set stack limit: out of memory (CUDA_ERROR_OUT_OF_MEMORY, 2).

Having a look, via cuDeviceTotalMem it calculates the default stack size based on the total device memory (minus heap size, minus "128 MiB extra", divided by 4 SMs, 2048 threads each) to ~460 KiB, then "limit default size to 128 KiB maximum".
That, however is too much, because that computation actually should be done not with the total but on the free device memory.
For example, see https://stackoverflow.com/a/72382150/664214: "calculation which must be satisified".
(This is racy, obviously; see also #27 "[nvptx-run] Add --verbose/-v", which we then also shall re-visit...)
Manually doing the calculation, I get a limit in the order of ~90 KiB.
And indeed: a lauch with --stack-size 100000 works, but --stack-size 110000 fails.
I think we should do:

-r = CUDA_CALL_NOCHECK (cuDeviceTotalMem, &mem, dev);
+r = CUDA_CALL_NOCHECK (cuMemGetInfo, /* free */ &mem, /* total */ NULL);

I'm confirming this to calculate an (even lower) stack size of ~60 KiB.

... which brings us to the next question: how low should we go, should there be minimum (default) stack size that we enforce, so that we reliably get could not set stack limit: out of memory (CUDA_ERROR_OUT_OF_MEMORY, 2) instead of an inscrutable an illegal memory access was encountered (CUDA_ERROR_ILLEGAL_ADDRESS, 700)?

@amonakov, @vries: any thoughts?

@tschwinge tschwinge reopened this Oct 11, 2022
tschwinge added a commit that referenced this issue Jun 23, 2023
…emory [#8]

<#8 (comment)>:

> [...], on a laptop where the Nvidia Quadro P1000 is also used for
> graphics (Xorg, GNOME Shell, Firefox), per `nvidia-smi`, currently
> `3127MiB / 4096MiB` memory are in use.
> Any `nvptx-none-run` fails:
> `could not set stack limit: out of memory (CUDA_ERROR_OUT_OF_MEMORY, 2)`.
>
> Having a look, via `cuDeviceTotalMem` it calculates the default stack
> size based on the total device memory (minus heap size, minus
> "128 MiB extra", divided by 4 SMs, 2048 threads each) to ~460 KiB, then
> "limit default size to 128 KiB maximum".
> That, however is too much, because that computation actually should be
> done not with the *total* but on the *free* device memory.
> For example, see <https://stackoverflow.com/a/72382150/664214>:
> "calculation which must be satisified".
> (This is racy, obviously; [...].)
> Manually doing the calculation, I get a limit in the order of ~90 KiB.
> And indeed: a lauch with `--stack-size 100000` works, but
> `--stack-size 110000` fails.
> [Via `cuMemGetInfo`, `mem_free`] I'm confirming this to calculate an
> (even lower) stack size of ~60 KiB.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants