fix: improve test startup logic #89

victorlcm · 2022-01-06T13:57:09Z

Hello everyone,

We've been using this operator for a while and we observed a strange behavior on tests that took a long time on init phase. We noticed that they remained in "paused state" because the k6 http server was not ready yet. So, we've added a simple verification to make sure the http server is up before sending the start requests to the jobs. It's a simple fix, but the results were very effective.

yorugac · 2022-01-06T15:28:26Z

Hi @victorlcm, thank you for catching this case and fixing it! 👏

Just as a heads-up, I'll be able to fully test the logic of this PR only on the next week. But at first glance, it looks fine to me 👍

controllers/k6_start.go

yorugac

@victorlcm I've been looking at this PR more closely today and I wish to clarify with you the following: in your use case, do you run it with separate flag in yaml configuration?

So far, all seems good, though there is one more thing I request you to adjust, please see the comment 🙂 I'll check a few more things in the meantime.

controllers/k6_start.go

Co-authored-by: yorugac <[email protected]>

victorlcm · 2022-01-11T08:41:04Z

Hey @yorugac, we're running with separate false, while controlling how much resources each job uses by setting the runner limits and requests and limiting the VUs according to these resources 😄

yorugac · 2022-01-11T15:05:52Z

Hi @victorlcm, thanks for the fix but I'm afraid that's not all yet 😂 I've found one more issue in current approach and we really should try to fix the logic there. Let me explain. Operator uses k6's REST API for starting runners but it does it with hostnames, not IPs. Hostnames come from:

k6-operator/controllers/k6_start.go

Lines 59 to 61 in d268e0e

    
           for _, service := range sl.Items { 
        
           	hostnames = append(hostnames, service.ObjectMeta.Name) 
        
           }

In contrast, your PR relies on IP of the runner. This approach is incorrect for the cases when there is more than 1 runner on the same node. I assume you haven't encountered such an error during your usage but it can happen otherwise. Basically the check for isPodReady should become isServiceReady and it can be executed right in the above loop. I.e. if at least 1 service is not responding, then log error and return false.

Then the checks become separate:

get all pods and see if they are in Running state
get all services and see if they respond to REST
now start

Does it makes sense? Could you please try this updated logic with your use case, to see if it continues to work?
I'm sorry for causing confusion but I've spotted this issue only during testing with your branch recently 😅

victorlcm · 2022-01-13T12:33:50Z

Hi @victorlcm, thanks for the fix but I'm afraid that's not all yet 😂 I've found one more issue in current approach and we really should try to fix the logic there. Let me explain. Operator uses k6's REST API for starting runners but it does it with hostnames, not IPs. Hostnames come from:

k6-operator/controllers/k6_start.go

Lines 59 to 61 in d268e0e

for _, service := range sl.Items {

hostnames = append(hostnames, service.ObjectMeta.Name)

}

In contrast, your PR relies on IP of the runner. This approach is incorrect for the cases when there is more than 1 runner on the same node. I assume you haven't encountered such an error during your usage but it can happen otherwise. Basically the check for isPodReady should become isServiceReady and it can be executed right in the above loop. I.e. if at least 1 service is not responding, then log error and return false.

Then the checks become separate:

get all pods and see if they are in Running state

get all services and see if they respond to REST

now start

Does it makes sense? Could you please try this updated logic with your use case, to see if it continues to work? I'm sorry for causing confusion but I've spotted this issue only during testing with your branch recently 😅

Hey @yorugac, I'm sorry for the delay, this is a busy week for me. Your point makes sense! I'll make the changes later this week or early next week, is that ok?

Thanks for the great review!

yorugac · 2022-01-13T13:25:57Z

@victorlcm NP, that'll be great! Looking forward to your update 🙂

victorlcm · 2022-01-17T12:34:36Z

hey @yorugac, I've changed to use the service approach, it seems to be working fine now, can you test it again? Tks! 😄

knmsk · 2022-01-17T15:56:00Z

@yorugac the way this PR is, if this get approved, with more changes this can also remove the starter pod and add the logic on the operator to make a curl on the runners?

yorugac · 2022-01-18T09:04:17Z

Hi @knmsk, good question. I think what you described is an option to move away from Curl completely (#87) but it'd result in removing a sync point from logic. I.e. right now, starter uses basically one command and acts as a synchronization point and removing it would create potential delays in when runners actually start. AFAIK, some setups might depend on starter having a separate security requirements (e.g. see #74). These points might be negligible in most cases or they might depend on exact implementation but I believe implications should be thought through before going that route.
Do you have a specific case in mind for that? Perhaps, it might make sense to create a separate issue to open a discussion on topic 🙂

knmsk · 2022-01-18T10:35:15Z

This makes a lot of sense, i will create a separated issue to discuss this topic :)
Sorry for bothering in this PR

yorugac

Hi @victorlcm, thank you for the update! I've tested it yesterday and I think it's good to go now 👍 😄

There is one thing about domains that I wonder about: starters right now use a short form with http://${service.ObjectMeta.Name}:6565 🤔 But in this PR's check namespace is likely needed indeed... ATM, I'm not aware if this can fail anywhere but we should probably make a consistent scheme for DNS sometime.

Thanks again for all your work!

fix: improve test startup logic

53b4729

yorugac self-requested a review January 6, 2022 15:26

yorugac mentioned this pull request Jan 6, 2022

Remove starter job in favor of built-in implementation #87

Open

na-- reviewed Jan 7, 2022

View reviewed changes

controllers/k6_start.go Outdated Show resolved Hide resolved

fix: change usage to statusCode

6ddc296

yorugac requested changes Jan 10, 2022

View reviewed changes

controllers/k6_start.go Outdated Show resolved Hide resolved

fix: avoid requesting when hostip is empty

99b2be6

Co-authored-by: yorugac <[email protected]>

victorlcm requested a review from yorugac January 11, 2022 12:04

feat: change to service instead of pods

442e1cf

yorugac approved these changes Jan 19, 2022

View reviewed changes

yorugac merged commit 969dc2d into grafana:main Jan 19, 2022

yorugac mentioned this pull request Dec 14, 2022

support optional env which dictates status check url format #146

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: improve test startup logic #89

fix: improve test startup logic #89

victorlcm commented Jan 6, 2022

yorugac commented Jan 6, 2022

yorugac left a comment

victorlcm commented Jan 11, 2022

yorugac commented Jan 11, 2022

victorlcm commented Jan 13, 2022

yorugac commented Jan 13, 2022

victorlcm commented Jan 17, 2022

knmsk commented Jan 17, 2022

yorugac commented Jan 18, 2022

knmsk commented Jan 18, 2022 •

edited

Loading

yorugac left a comment

fix: improve test startup logic #89

fix: improve test startup logic #89

Conversation

victorlcm commented Jan 6, 2022

yorugac commented Jan 6, 2022

yorugac left a comment

Choose a reason for hiding this comment

victorlcm commented Jan 11, 2022

yorugac commented Jan 11, 2022

victorlcm commented Jan 13, 2022

yorugac commented Jan 13, 2022

victorlcm commented Jan 17, 2022

knmsk commented Jan 17, 2022

yorugac commented Jan 18, 2022

knmsk commented Jan 18, 2022 • edited Loading

yorugac left a comment

Choose a reason for hiding this comment

knmsk commented Jan 18, 2022 •

edited

Loading