Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slowdown for cluster load 10% to 190% #12

Open
AluriJaganMohini opened this issue Dec 29, 2020 · 2 comments
Open

Slowdown for cluster load 10% to 190% #12

AluriJaganMohini opened this issue Dec 29, 2020 · 2 comments

Comments

@AluriJaganMohini
Copy link

AluriJaganMohini commented Dec 29, 2020

Dear Hongzi,

I am trying to reproduce all the results that you reported on paper. From the source code, it is unclear to plot the slowdown from 10% to 190% cluster load. When I run the run_script.py, I am able to see the generated logs and nothing corresponds to the Figure 4.

Can you please give a detailed explanation on how you are plotting the slowdown for cluster load from 10% to 190%. Again from the source code, It is clear that you are relying on the job rate from 0.1 to 1.0 to vary the load from 10% to 190%, but when I tried to rely on just the job rate from 0.1 to 1.0 , and varied the cluster load from 10 to 190%, the slowdown after 100% was constant till 190%.

  1. It would be great, if you also say a few words about how the load is varied from 10% to 190%.
  2. and can you please tell me how to reproduce this figure 4 or how I can move forward with the logs generated to achieve Figure 4?.

Thank You.

@hongzimao
Copy link
Owner

Thanks for the detailed question and sorry for the late reply. Load larger than 100% has to be ephemeral. Two things can affect the system load, the interval between the new jobs as you pointed out, and the new job size distribution.

def normal_dist(self):
# new work duration
nw_len = np.random.randint(1, self.job_len + 1) # same length in every dimension
nw_size = np.zeros(self.num_res)
for i in range(self.num_res):
nw_size[i] = np.random.randint(1, self.max_nw_size + 1)
return nw_len, nw_size
def bi_model_dist(self):
# -- job length --
if np.random.rand() < self.job_small_chance: # small job
nw_len = np.random.randint(self.job_len_small_lower,
self.job_len_small_upper + 1)
else: # big job
nw_len = np.random.randint(self.job_len_big_lower,
self.job_len_big_upper + 1)
nw_size = np.zeros(self.num_res)
# -- job resource request --
dominant_res = np.random.randint(0, self.num_res)
for i in range(self.num_res):
if i == dominant_res:
nw_size[i] = np.random.randint(self.dominant_res_lower,
self.dominant_res_upper + 1)
else:
nw_size[i] = np.random.randint(self.other_res_lower,
self.other_res_upper + 1)
return nw_len, nw_size

You can compute the load using average area of new job per time interval / the width of the bottlenecked resource. If I remember it correctly, since our simulated interval is finite, the scenario is ephemeral. You can vary the two distributions to create different loads. Hope this helps!

@AluriJaganMohini
Copy link
Author

Thank You for your response. I am able to figure it out. One more thing, I wanted to get clarified is about the res_slot value. So In this repository, you have used res_slot = 10, but in another repository, you have modified it to 20. So, using 20 has given a very good slowdown value less than 2 for heuristics (SJF, Packer) as well. But the reported results in paper have got huge slowdown
when comparing the results using res_slot =20 . So, can we say that we can achieve different values of slowdown with different parameter values used ? and also using res_slot =20, gives us different slowdown values by maintaining the exact behaviour reported in paper ?

Waiting for your response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants