You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Often TensorFlow codes spawn many threads, but the jail recognizes "too many" threads while the actual number of threads are within the configured limit.
Potential solutions:
Directly read "/proc/{pid}/status" to get the actual number of threads from the OS. May incur some overheads when spawning new processes/threads in the child.
Guard the childCount variable with explicit locks.
But still, TensorFlow seems to increase the number of threads when we repeat calling regressors.
We need to find some good solution on this.
NOTE:
Even the following code produces a large number of threads more than the number of CPU cores allocated to the container:
Adding locks did not change anything, as expected because we increment/decrement childCount within a single goroutine which receives waitpid results via a channel.
After writing a function that reads procfs to get all children's number of threads recursively, I found that the original jail implementation is correct and numThreads value in "/proc/{pid}/status" contains only the direct children threads.
Then we need to find some way to further reduce the number of threads used by TensorFlow itself.
* Add a utility function that reads procfs recursively to count all
children processes and threads, but it gives the same result to
the original child counting mechanism via waitpid and ptrace.
Often TensorFlow codes spawn many threads, but the jail recognizes "too many" threads while the actual number of threads are within the configured limit.
Potential solutions:
But still, TensorFlow seems to increase the number of threads when we repeat calling regressors.
We need to find some good solution on this.
NOTE:
Even the following code produces a large number of threads more than the number of CPU cores allocated to the container:
The text was updated successfully, but these errors were encountered: