Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loss function (In Policy Gradient section), optimizer and entropy #9

Open
ahmadreza9 opened this issue May 31, 2020 · 12 comments
Open

Comments

@ahmadreza9
Copy link

ahmadreza9 commented May 31, 2020

Dear Mr.hongzi
I was interested in your resource scheduling method. Now, I stuck in your network class. I can't understand why you used the blow function:
loss = T.log(prob_act[T.arange(N), actions]).dot(values) / N
Did you calculate the special loss function? If you didn't, what's the name of this loss function?

@ahmadreza9 ahmadreza9 changed the title Policy Gradient loss function loss function (In Policy Gradient section) May 31, 2020
@ahmadreza9
Copy link
Author

I think that is related to Monte Carlo.
image

@hongzimao
Copy link
Owner

hongzimao commented Jun 2, 2020

Yes, the code implements the REINFORCE algorithm. Notice that the loss is log_pi(s, a) * (value - baseline). When minimizing the loss, the underlying optimizer will differentiate the loss (which corresponds to the gradient operator in the equation) and apply a gradient step. Hope this helps!

@ahmadreza9
Copy link
Author

ahmadreza9 commented Jun 3, 2020

So I can see you calculate this function:
image

Now Could you tell me how did you compute Gt = vt-bt? ( I can't see anywhere you had bt calculation. So, that must be hidden in your loss function.)
I just found this comment self.num_seq_per_batch = 10 # number of sequences to compute baseline in prameters class

@ahmadreza9
Copy link
Author

ahmadreza9 commented Jun 3, 2020

Yes, the code implements the REINFORCE algorithm. Notice that the loss is log_pi(s, a) * (value - baseline). When minimizing the loss, the underlying optimizer will differentiate the loss (which corresponds to the gradient operator in the equation) and apply a gradient step. I hope this helps!

Why did you use RMSprop? What are the problems of using the ADAM optimizer? and what are the differences between end # termination type, 'no_new_job' or 'all_done' variable types?

@hongzimao
Copy link
Owner

hongzimao commented Jun 3, 2020

Here's how we computed the advantage Gt, with a time-based baseline: https://github.com/hongzimao/deeprm/blob/master/pg_re.py#L193-L202.

IIRC, RMSProp was slightly more stable than Adam in our experiment. FWIW, A3C original paper also used RMSProp (https://arxiv.org/pdf/1602.01783.pdf see Optimizations in section 4).

The last comment was about different episode termination criteria. It's their literal meaning I think, 'no new jobs' ends the episode when no new jobs are coming and 'all_done' only terminates the episode when all jobs (including the unfinished ones when 'no_new_jobs' is satisfied) are completed:

deeprm/environment.py

Lines 255 to 265 in b42eff0

if self.end == "no_new_job": # end of new job sequence
if self.seq_idx >= self.pa.simu_len:
done = True
elif self.end == "all_done": # everything has to be finished
if self.seq_idx >= self.pa.simu_len and \
len(self.machine.running_job) == 0 and \
all(s is None for s in self.job_slot.slot) and \
all(s is None for s in self.job_backlog.backlog):
done = True
elif self.curr_time > self.pa.episode_max_length: # run too long, force termination
done = True
.

@ahmadreza9
Copy link
Author

ahmadreza9 commented Jun 3, 2020

Here's how we computed the advantage Gt, with a time-based baseline: https://github.com/hongzimao/deeprm/blob/master/pg_re.py#L193-L202.

IIRC, RMSProp was slightly more stable than Adam in our experiment. FWIW, A3C original paper also used RMSProp (https://arxiv.org/pdf/1602.01783.pdf see Optimizations in section 4).

The last comment was about different episode termination criteria. It's their literal meaning I think, 'no new jobs' ends the episode when no new jobs are coming and 'all_done' only terminates the episode when all jobs (including the unfinished ones when 'no_new_jobs' is satisfied) are completed:

deeprm/environment.py

Lines 255 to 265 in b42eff0

if self.end == "no_new_job": # end of new job sequence
if self.seq_idx >= self.pa.simu_len:
done = True
elif self.end == "all_done": # everything has to be finished
if self.seq_idx >= self.pa.simu_len and \
len(self.machine.running_job) == 0 and \
all(s is None for s in self.job_slot.slot) and \
all(s is None for s in self.job_backlog.backlog):
done = True
elif self.curr_time > self.pa.episode_max_length: # run too long, force termination
done = True

.

Thanks, for your discretions. I wonder why you implemented optimizers. Did you have any special goals?(they're predefined and implemented) Does theano library implement these common optimizers or not?

@ahmadreza9
Copy link
Author

ahmadreza9 commented Jun 5, 2020

  1. You defined entropy and show the mean of that. why you didn't use this metric in your article's evaluation part?
  2. I see you use mem_alloc = 4 variable in pg_su.py. why you assign this constant?

@ahmadreza9 ahmadreza9 changed the title loss function (In Policy Gradient section) loss function (In Policy Gradient section), optimizer and entropy Jun 5, 2020
@hongzimao
Copy link
Owner

hongzimao commented Jun 6, 2020

We didn't reimplement the optimizers, Theano (or more commonly used Tensorflow or Pytorch all have built-in optimizers like RMSProp or Adam).

The entropy is for promoting the exploration in the beginning of RL training.

pg_su is for supervised learning. If I remember, mem_alloc is just some parameter controlling the size of generated dataset.

@ahmadreza9
Copy link
Author

ahmadreza9 commented Jun 6, 2020

Thank you for your attention but I must clarify that I talked about rmsprop_updates and I would appreciate change stepsize (3rd input of blow function) to lr_rate. (It confused me :))
def rmsprop_updates(grads, params, stepsize, rho=0.9, epsilon=1e-9):

@ahmadreza9 ahmadreza9 reopened this Jun 6, 2020
@ahmadreza9
Copy link
Author

ahmadreza9 commented Jun 6, 2020

The special parameter in your RMSpropUpdate function is grad which returns gradients of loss with respect to params grads = T.grad(loss, params). I found torch.autograd.grad(gg, xx) for PyTorch and tf.gradients(ys,xs) for Tensorflow. Do these equivalents to your grads? (if you worked with Pytorch and Tensorflow)

@hongzimao
Copy link
Owner

You are right that rmsprop_updates is a customized function. I guess back at that time standardized library for those optimizers were not available :) Things are easier nowadays. And you are right about the gradient operations in tensorflow or pytorch.

@ahmadreza9
Copy link
Author

ahmadreza9 commented Dec 5, 2020

The entropy is for promoting the exploration at the beginning of RL training.

Sir, your answer about entropy is not convincing for me. You had a method for it but did not use it in your network or Reinforce training(single or multiple)

@ahmadreza9 ahmadreza9 reopened this Dec 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants