Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: improve test startup logic #89

Merged
merged 4 commits into from
Jan 19, 2022
Merged

fix: improve test startup logic #89

merged 4 commits into from
Jan 19, 2022

Conversation

victorlcm
Copy link
Contributor

Hello everyone,

We've been using this operator for a while and we observed a strange behavior on tests that took a long time on init phase. We noticed that they remained in "paused state" because the k6 http server was not ready yet. So, we've added a simple verification to make sure the http server is up before sending the start requests to the jobs. It's a simple fix, but the results were very effective.

@yorugac yorugac self-requested a review January 6, 2022 15:26
@yorugac
Copy link
Collaborator

yorugac commented Jan 6, 2022

Hi @victorlcm, thank you for catching this case and fixing it! 👏

Just as a heads-up, I'll be able to fully test the logic of this PR only on the next week. But at first glance, it looks fine to me 👍

controllers/k6_start.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@yorugac yorugac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@victorlcm I've been looking at this PR more closely today and I wish to clarify with you the following: in your use case, do you run it with separate flag in yaml configuration?

So far, all seems good, though there is one more thing I request you to adjust, please see the comment 🙂 I'll check a few more things in the meantime.

controllers/k6_start.go Outdated Show resolved Hide resolved
@victorlcm
Copy link
Contributor Author

Hey @yorugac, we're running with separate false, while controlling how much resources each job uses by setting the runner limits and requests and limiting the VUs according to these resources 😄

@victorlcm victorlcm requested a review from yorugac January 11, 2022 12:04
@yorugac
Copy link
Collaborator

yorugac commented Jan 11, 2022

Hi @victorlcm, thanks for the fix but I'm afraid that's not all yet 😂 I've found one more issue in current approach and we really should try to fix the logic there. Let me explain. Operator uses k6's REST API for starting runners but it does it with hostnames, not IPs. Hostnames come from:

for _, service := range sl.Items {
hostnames = append(hostnames, service.ObjectMeta.Name)
}

In contrast, your PR relies on IP of the runner. This approach is incorrect for the cases when there is more than 1 runner on the same node. I assume you haven't encountered such an error during your usage but it can happen otherwise. Basically the check for isPodReady should become isServiceReady and it can be executed right in the above loop. I.e. if at least 1 service is not responding, then log error and return false.

Then the checks become separate:

  1. get all pods and see if they are in Running state
  2. get all services and see if they respond to REST
  3. now start

Does it makes sense? Could you please try this updated logic with your use case, to see if it continues to work?
I'm sorry for causing confusion but I've spotted this issue only during testing with your branch recently 😅

@victorlcm
Copy link
Contributor Author

Hi @victorlcm, thanks for the fix but I'm afraid that's not all yet 😂 I've found one more issue in current approach and we really should try to fix the logic there. Let me explain. Operator uses k6's REST API for starting runners but it does it with hostnames, not IPs. Hostnames come from:

for _, service := range sl.Items {
hostnames = append(hostnames, service.ObjectMeta.Name)
}

In contrast, your PR relies on IP of the runner. This approach is incorrect for the cases when there is more than 1 runner on the same node. I assume you haven't encountered such an error during your usage but it can happen otherwise. Basically the check for isPodReady should become isServiceReady and it can be executed right in the above loop. I.e. if at least 1 service is not responding, then log error and return false.

Then the checks become separate:

  1. get all pods and see if they are in Running state
  2. get all services and see if they respond to REST
  3. now start

Does it makes sense? Could you please try this updated logic with your use case, to see if it continues to work? I'm sorry for causing confusion but I've spotted this issue only during testing with your branch recently 😅

Hey @yorugac, I'm sorry for the delay, this is a busy week for me. Your point makes sense! I'll make the changes later this week or early next week, is that ok?

Thanks for the great review!

@yorugac
Copy link
Collaborator

yorugac commented Jan 13, 2022

@victorlcm NP, that'll be great! Looking forward to your update 🙂

@victorlcm
Copy link
Contributor Author

hey @yorugac, I've changed to use the service approach, it seems to be working fine now, can you test it again? Tks! 😄

@knmsk
Copy link
Contributor

knmsk commented Jan 17, 2022

@yorugac the way this PR is, if this get approved, with more changes this can also remove the starter pod and add the logic on the operator to make a curl on the runners?

@yorugac
Copy link
Collaborator

yorugac commented Jan 18, 2022

Hi @knmsk, good question. I think what you described is an option to move away from Curl completely (#87) but it'd result in removing a sync point from logic. I.e. right now, starter uses basically one command and acts as a synchronization point and removing it would create potential delays in when runners actually start. AFAIK, some setups might depend on starter having a separate security requirements (e.g. see #74). These points might be negligible in most cases or they might depend on exact implementation but I believe implications should be thought through before going that route.
Do you have a specific case in mind for that? Perhaps, it might make sense to create a separate issue to open a discussion on topic 🙂

@knmsk
Copy link
Contributor

knmsk commented Jan 18, 2022

This makes a lot of sense, i will create a separated issue to discuss this topic :)
Sorry for bothering in this PR

Copy link
Collaborator

@yorugac yorugac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @victorlcm, thank you for the update! I've tested it yesterday and I think it's good to go now 👍 😄

There is one thing about domains that I wonder about: starters right now use a short form with http://${service.ObjectMeta.Name}:6565 🤔 But in this PR's check namespace is likely needed indeed... ATM, I'm not aware if this can fail anywhere but we should probably make a consistent scheme for DNS sometime.

Thanks again for all your work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants