-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
loss function (In Policy Gradient section), optimizer and entropy #9
Comments
Yes, the code implements the REINFORCE algorithm. Notice that the loss is log_pi(s, a) * (value - baseline). When minimizing the loss, the underlying optimizer will differentiate the loss (which corresponds to the gradient operator in the equation) and apply a gradient step. Hope this helps! |
Why did you use RMSprop? What are the problems of using the ADAM optimizer? and what are the differences between |
Here's how we computed the advantage Gt, with a time-based baseline: https://github.com/hongzimao/deeprm/blob/master/pg_re.py#L193-L202. IIRC, RMSProp was slightly more stable than Adam in our experiment. FWIW, A3C original paper also used RMSProp (https://arxiv.org/pdf/1602.01783.pdf see Optimizations in section 4). The last comment was about different episode termination criteria. It's their literal meaning I think, 'no new jobs' ends the episode when no new jobs are coming and 'all_done' only terminates the episode when all jobs (including the unfinished ones when 'no_new_jobs' is satisfied) are completed: Lines 255 to 265 in b42eff0
|
Thanks, for your discretions. I wonder why you implemented optimizers. Did you have any special goals?(they're predefined and implemented) Does theano library implement these common optimizers or not? |
|
We didn't reimplement the optimizers, Theano (or more commonly used Tensorflow or Pytorch all have built-in optimizers like RMSProp or Adam). The entropy is for promoting the exploration in the beginning of RL training.
|
Thank you for your attention but I must clarify that I talked about rmsprop_updates and I would appreciate change stepsize (3rd input of blow function) to lr_rate. (It confused me :)) |
The special parameter in your RMSpropUpdate function is grad which returns gradients of loss with respect to params |
You are right that |
Sir, your answer about entropy is not convincing for me. You had a method for it but did not use it in your network or Reinforce training(single or multiple) |
Dear Mr.hongzi
I was interested in your resource scheduling method. Now, I stuck in your network class. I can't understand why you used the blow function:
loss = T.log(prob_act[T.arange(N), actions]).dot(values) / N
Did you calculate the special loss function? If you didn't, what's the name of this loss function?
The text was updated successfully, but these errors were encountered: