Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(aws): CleanupAlarmsAgent cycle to catch exceptions #6333

Merged
merged 2 commits into from
Jan 22, 2025

Conversation

christosarvanitis
Copy link
Member

A bad account or any exception currently stops the CleanupAlarmsAgent.

Catching the exception and logging it since the purpose of this agent is to cleanup the stale cloudwatch alarms from the aws/ecs accounts in Spinnaker and not to verify if the accounts are valid or not.

}
} catch (Exception e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a catch block to add "higher up" (where the run method is called) that would (also) help? Like, what if there exceptions in other agents?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Higher up there is already a try/catch block in the RunnableAgent initial implementation but i have pushed a try/catch block in the CleanupDetachedInstancesAgent as well which implements a RunnableAgent

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like run is called from RunnableAgentExecution.executeAgent and all the places that call executeAgent are similar to DefaultAgentScheduler. There's a try/catch, but it only updates metrics. So what we get here is some extra logging, but I don't see how it's going to help with agent scheduling.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cleanup agents are going through each account and trying to do a cleanup of any stale alarms in cloudwatch or any detached instances etc. When a bad account is present in the credentials repository the agent stops and doesnt go through the rest of the accounts.
By adding a try/catch here we are logging that there was a problem with X account and continue to the next account.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aaah yes, the try/catch here and in CleanupDetachedInstancesAgent is inside getAccounts().each. Seems like the consequences of this fix are good...that instead of dying on the first error, we'll continue to clean up other accounts.

@christosarvanitis christosarvanitis force-pushed the fix-CleanupAlarmsAgentException branch from 7b366a7 to 9b0e768 Compare January 20, 2025 08:17
@christosarvanitis christosarvanitis force-pushed the fix-CleanupAlarmsAgentException branch from 9b0e768 to 7d77512 Compare January 20, 2025 10:14
@dbyron-sf
Copy link
Contributor

@christosarvanitis if you bring this branch up to date, I think it's ready to go.

@christosarvanitis
Copy link
Member Author

@christosarvanitis if you bring this branch up to date, I think it's ready to go.

Done! Thanks @dbyron-sf

@dbyron-sf dbyron-sf added the ready to merge Approved and ready for a merge label Jan 22, 2025
@mergify mergify bot added the auto merged Merged automatically by a bot label Jan 22, 2025
@mergify mergify bot merged commit c4df136 into spinnaker:master Jan 22, 2025
23 checks passed
christosarvanitis added a commit to armory-io/clouddriver that referenced this pull request Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto merged Merged automatically by a bot ready to merge Approved and ready for a merge target-release/1.37
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants