-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nvptx-none-run: CU_LIMIT_STACK_SIZE #8
Comments
... to work around <#8>. According to Table 12, Technical Specifications per Compute Capability, on <http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications>, there is 512 KiB of local memory per thread, so a stack with 256 KiB seemed workable (and indeed is, with Nvidia Quadro K1000M hardware, driver version 340.24, CUDA 5.5 installation), but not on a system with Nvidia Tesla K20c hardware, driver version 319.37, CUDA 5.5 installation. On the Nvidia Quadro K1000M system, no changes in GCC testsuite results.
It appears that memory accounting in CUDA drivers can be very conservative, sometimes applying stack reservation for the maximum amount of threads the device can host. If so, even the current limit of 128 KiB is too much for some Maxwell devices. Instead of hardcoding some limit, calculate it at run time based on available memory and maximum thread count. Allow the user to override this automatic guess on the command line.
Your notes and my experience suggest that failures have to do with the driver making "worst-case" stack size accounting: per-thread stack size multiplied by the number of threads the device can host simultaneously. Pull request #10 implements a variant of safe-ish maximization based on that. |
Just a few years later, I've finally collected enough evidence that this needs to be re-visited. ;-) For example, on a laptop where the Nvidia Quadro P1000 is also used for graphics (Xorg, GNOME Shell, Firefox), per Having a look, via
I'm confirming this to calculate an (even lower) stack size of ~60 KiB. ... which brings us to the next question: how low should we go, should there be minimum (default) stack size that we enforce, so that we reliably get |
…emory [#8] <#8 (comment)>: > [...], on a laptop where the Nvidia Quadro P1000 is also used for > graphics (Xorg, GNOME Shell, Firefox), per `nvidia-smi`, currently > `3127MiB / 4096MiB` memory are in use. > Any `nvptx-none-run` fails: > `could not set stack limit: out of memory (CUDA_ERROR_OUT_OF_MEMORY, 2)`. > > Having a look, via `cuDeviceTotalMem` it calculates the default stack > size based on the total device memory (minus heap size, minus > "128 MiB extra", divided by 4 SMs, 2048 threads each) to ~460 KiB, then > "limit default size to 128 KiB maximum". > That, however is too much, because that computation actually should be > done not with the *total* but on the *free* device memory. > For example, see <https://stackoverflow.com/a/72382150/664214>: > "calculation which must be satisified". > (This is racy, obviously; [...].) > Manually doing the calculation, I get a limit in the order of ~90 KiB. > And indeed: a lauch with `--stack-size 100000` works, but > `--stack-size 110000` fails. > [Via `cuMemGetInfo`, `mem_free`] I'm confirming this to calculate an > (even lower) stack size of ~60 KiB.
For
nvptx-none
toolchain testing, we're usingnvptx-none-run
to launch kernels on a 1 x 1 x 1 grid with 1 x 1 x 1 threads. We'd like to usecuCtxSetLimit(CU_LIMIT_STACK_SIZE)
to increase the per-thread stack size from its tiny default value (1 KiB?).Even though a
cuCtxGetLimit(CU_LIMIT_STACK_SIZE)
does acknowledge the value set, if this is set "too high", inscrutable errors (CUDA_ERROR_ILLEGAL_ADDRESS
) may result from latercuModuleLoadData
(?!) orcuLaunchKernel
calls.It is unclear how to safely maximize the per-thread stack size.
The text was updated successfully, but these errors were encountered: