loss function (In Policy Gradient section), optimizer and entropy #9

ahmadreza9 · 2020-05-31T14:50:09Z

Dear Mr.hongzi
I was interested in your resource scheduling method. Now, I stuck in your network class. I can't understand why you used the blow function:
loss = T.log(prob_act[T.arange(N), actions]).dot(values) / N
Did you calculate the special loss function? If you didn't, what's the name of this loss function?

The text was updated successfully, but these errors were encountered:

ahmadreza9 · 2020-05-31T16:53:58Z

I think that is related to Monte Carlo.

hongzimao · 2020-06-02T15:56:10Z

Yes, the code implements the REINFORCE algorithm. Notice that the loss is log_pi(s, a) * (value - baseline). When minimizing the loss, the underlying optimizer will differentiate the loss (which corresponds to the gradient operator in the equation) and apply a gradient step. Hope this helps!

ahmadreza9 · 2020-06-03T05:54:22Z

So I can see you calculate this function:

Now Could you tell me how did you compute Gt = vt-bt? ( I can't see anywhere you had bt calculation. So, that must be hidden in your loss function.)
I just found this comment self.num_seq_per_batch = 10 # number of sequences to compute baseline in prameters class

ahmadreza9 · 2020-06-03T06:08:43Z

Yes, the code implements the REINFORCE algorithm. Notice that the loss is log_pi(s, a) * (value - baseline). When minimizing the loss, the underlying optimizer will differentiate the loss (which corresponds to the gradient operator in the equation) and apply a gradient step. I hope this helps!

Why did you use RMSprop? What are the problems of using the ADAM optimizer? and what are the differences between end # termination type, 'no_new_job' or 'all_done' variable types?

hongzimao · 2020-06-03T14:11:43Z

Here's how we computed the advantage Gt, with a time-based baseline: https://github.com/hongzimao/deeprm/blob/master/pg_re.py#L193-L202.

IIRC, RMSProp was slightly more stable than Adam in our experiment. FWIW, A3C original paper also used RMSProp (https://arxiv.org/pdf/1602.01783.pdf see Optimizations in section 4).

The last comment was about different episode termination criteria. It's their literal meaning I think, 'no new jobs' ends the episode when no new jobs are coming and 'all_done' only terminates the episode when all jobs (including the unfinished ones when 'no_new_jobs' is satisfied) are completed:

deeprm/environment.py

Lines 255 to 265 in b42eff0

    
           if self.end == "no_new_job":  # end of new job sequence 
        
               if self.seq_idx >= self.pa.simu_len: 
        
                   done = True 
        
           elif self.end == "all_done":  # everything has to be finished 
        
               if self.seq_idx >= self.pa.simu_len and \ 
        
                  len(self.machine.running_job) == 0 and \ 
        
                  all(s is None for s in self.job_slot.slot) and \ 
        
                  all(s is None for s in self.job_backlog.backlog): 
        
                   done = True 
        
               elif self.curr_time > self.pa.episode_max_length:  # run too long, force termination 
        
                   done = True

.

ahmadreza9 · 2020-06-03T16:01:19Z

Here's how we computed the advantage Gt, with a time-based baseline: https://github.com/hongzimao/deeprm/blob/master/pg_re.py#L193-L202.

IIRC, RMSProp was slightly more stable than Adam in our experiment. FWIW, A3C original paper also used RMSProp (https://arxiv.org/pdf/1602.01783.pdf see Optimizations in section 4).

The last comment was about different episode termination criteria. It's their literal meaning I think, 'no new jobs' ends the episode when no new jobs are coming and 'all_done' only terminates the episode when all jobs (including the unfinished ones when 'no_new_jobs' is satisfied) are completed:

deeprm/environment.py

Lines 255 to 265 in b42eff0

if self.end == "no_new_job": # end of new job sequence

if self.seq_idx >= self.pa.simu_len:

done = True

elif self.end == "all_done": # everything has to be finished

if self.seq_idx >= self.pa.simu_len and \

len(self.machine.running_job) == 0 and \

all(s is None for s in self.job_slot.slot) and \

all(s is None for s in self.job_backlog.backlog):

done = True

elif self.curr_time > self.pa.episode_max_length: # run too long, force termination

done = True

.

Thanks, for your discretions. I wonder why you implemented optimizers. Did you have any special goals?(they're predefined and implemented) Does theano library implement these common optimizers or not?

ahmadreza9 · 2020-06-05T14:04:49Z

You defined entropy and show the mean of that. why you didn't use this metric in your article's evaluation part?
I see you use mem_alloc = 4 variable in pg_su.py. why you assign this constant?

hongzimao · 2020-06-06T02:17:29Z

We didn't reimplement the optimizers, Theano (or more commonly used Tensorflow or Pytorch all have built-in optimizers like RMSProp or Adam).

The entropy is for promoting the exploration in the beginning of RL training.

pg_su is for supervised learning. If I remember, mem_alloc is just some parameter controlling the size of generated dataset.

ahmadreza9 · 2020-06-06T05:48:30Z

Thank you for your attention but I must clarify that I talked about rmsprop_updates and I would appreciate change stepsize (3rd input of blow function) to lr_rate. (It confused me :))
def rmsprop_updates(grads, params, stepsize, rho=0.9, epsilon=1e-9):

ahmadreza9 · 2020-06-06T07:21:29Z

The special parameter in your RMSpropUpdate function is grad which returns gradients of loss with respect to params grads = T.grad(loss, params). I found torch.autograd.grad(gg, xx) for PyTorch and tf.gradients(ys,xs) for Tensorflow. Do these equivalents to your grads? (if you worked with Pytorch and Tensorflow)

hongzimao · 2020-06-06T10:40:32Z

You are right that rmsprop_updates is a customized function. I guess back at that time standardized library for those optimizers were not available :) Things are easier nowadays. And you are right about the gradient operations in tensorflow or pytorch.

ahmadreza9 · 2020-12-05T09:47:33Z

The entropy is for promoting the exploration at the beginning of RL training.

Sir, your answer about entropy is not convincing for me. You had a method for it but did not use it in your network or Reinforce training(single or multiple)

ahmadreza9 changed the title ~~Policy Gradient loss function~~ loss function (In Policy Gradient section) May 31, 2020

ahmadreza9 changed the title ~~loss function (In Policy Gradient section)~~ loss function (In Policy Gradient section), optimizer and entropy Jun 5, 2020

ahmadreza9 closed this as completed Jun 6, 2020

ahmadreza9 reopened this Jun 6, 2020

ahmadreza9 closed this as completed Jun 6, 2020

ahmadreza9 reopened this Dec 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

loss function (In Policy Gradient section), optimizer and entropy #9

loss function (In Policy Gradient section), optimizer and entropy #9

ahmadreza9 commented May 31, 2020 •

edited

Loading

ahmadreza9 commented May 31, 2020

hongzimao commented Jun 2, 2020 •

edited

Loading

ahmadreza9 commented Jun 3, 2020 •

edited

Loading

ahmadreza9 commented Jun 3, 2020 •

edited

Loading

hongzimao commented Jun 3, 2020 •

edited

Loading

ahmadreza9 commented Jun 3, 2020 •

edited

Loading

ahmadreza9 commented Jun 5, 2020 •

edited

Loading

hongzimao commented Jun 6, 2020 •

edited

Loading

ahmadreza9 commented Jun 6, 2020 •

edited

Loading

ahmadreza9 commented Jun 6, 2020 •

edited

Loading

hongzimao commented Jun 6, 2020

ahmadreza9 commented Dec 5, 2020 •

edited

Loading

loss function (In Policy Gradient section), optimizer and entropy #9

loss function (In Policy Gradient section), optimizer and entropy #9

Comments

ahmadreza9 commented May 31, 2020 • edited Loading

ahmadreza9 commented May 31, 2020

hongzimao commented Jun 2, 2020 • edited Loading

ahmadreza9 commented Jun 3, 2020 • edited Loading

ahmadreza9 commented Jun 3, 2020 • edited Loading

hongzimao commented Jun 3, 2020 • edited Loading

ahmadreza9 commented Jun 3, 2020 • edited Loading

ahmadreza9 commented Jun 5, 2020 • edited Loading

hongzimao commented Jun 6, 2020 • edited Loading

ahmadreza9 commented Jun 6, 2020 • edited Loading

ahmadreza9 commented Jun 6, 2020 • edited Loading

hongzimao commented Jun 6, 2020

ahmadreza9 commented Dec 5, 2020 • edited Loading

ahmadreza9 commented May 31, 2020 •

edited

Loading

hongzimao commented Jun 2, 2020 •

edited

Loading

ahmadreza9 commented Jun 3, 2020 •

edited

Loading

ahmadreza9 commented Jun 3, 2020 •

edited

Loading

hongzimao commented Jun 3, 2020 •

edited

Loading

ahmadreza9 commented Jun 3, 2020 •

edited

Loading

ahmadreza9 commented Jun 5, 2020 •

edited

Loading

hongzimao commented Jun 6, 2020 •

edited

Loading

ahmadreza9 commented Jun 6, 2020 •

edited

Loading

ahmadreza9 commented Jun 6, 2020 •

edited

Loading

ahmadreza9 commented Dec 5, 2020 •

edited

Loading