Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(aws): CleanupAlarmsAgent cycle to catch exceptions #6333

Merged
merged 2 commits into from
Jan 22, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -86,36 +86,38 @@ class CleanupAlarmsAgent implements RunnableAgent, CustomScheduledAgent {
getAccounts().each { NetflixAmazonCredentials credentials ->
credentials.regions.each { AmazonCredentials.AWSRegion region ->
log.info("Looking for alarms to delete")

def cloudWatch = amazonClientProvider.getCloudWatch(credentials, region.name)
Set<String> attachedAlarms = getAttachedAlarms(amazonClientProvider.getAutoScaling(credentials, region.name))
def describeAlarmsRequest = new DescribeAlarmsRequest().withStateValue(StateValue.INSUFFICIENT_DATA)

while (true) {
def result = cloudWatch.describeAlarms(describeAlarmsRequest)

List<MetricAlarm> alarmsToDelete = result.metricAlarms.findAll {
it.stateUpdatedTimestamp.before(DateTime.now().minusDays(daysToLeave).toDate()) &&
!attachedAlarms.contains(it.alarmName) &&
ALARM_NAME_PATTERN.matcher(it.alarmName).matches()
}

if (alarmsToDelete) {
// terminate up to 20 alarms at a time (avoids any AWS limits on # of concurrent deletes)
alarmsToDelete.collate(20).each {
log.info("Deleting ${it.size()} alarms in ${credentials.name}/${region.name} " +
"(alarms: ${it.alarmName.join(", ")})")
cloudWatch.deleteAlarms(new DeleteAlarmsRequest().withAlarmNames(it.alarmName))
Thread.sleep(500)
try {
def cloudWatch = amazonClientProvider.getCloudWatch(credentials, region.name)
Set<String> attachedAlarms = getAttachedAlarms(amazonClientProvider.getAutoScaling(credentials, region.name))
def describeAlarmsRequest = new DescribeAlarmsRequest().withStateValue(StateValue.INSUFFICIENT_DATA)

while (true) {
def result = cloudWatch.describeAlarms(describeAlarmsRequest)

List<MetricAlarm> alarmsToDelete = result.metricAlarms.findAll {
it.stateUpdatedTimestamp.before(DateTime.now().minusDays(daysToLeave).toDate()) &&
!attachedAlarms.contains(it.alarmName) &&
ALARM_NAME_PATTERN.matcher(it.alarmName).matches()
}

}
if (alarmsToDelete) {
// terminate up to 20 alarms at a time (avoids any AWS limits on # of concurrent deletes)
alarmsToDelete.collate(20).each {
log.info("Deleting ${it.size()} alarms in ${credentials.name}/${region.name} " +
"(alarms: ${it.alarmName.join(", ")})")
cloudWatch.deleteAlarms(new DeleteAlarmsRequest().withAlarmNames(it.alarmName))
Thread.sleep(500)
}
}

if (result.nextToken) {
describeAlarmsRequest.withNextToken(result.nextToken)
} else {
break
if (result.nextToken) {
describeAlarmsRequest.withNextToken(result.nextToken)
} else {
break
}
}
} catch (Exception e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a catch block to add "higher up" (where the run method is called) that would (also) help? Like, what if there exceptions in other agents?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Higher up there is already a try/catch block in the RunnableAgent initial implementation but i have pushed a try/catch block in the CleanupDetachedInstancesAgent as well which implements a RunnableAgent

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like run is called from RunnableAgentExecution.executeAgent and all the places that call executeAgent are similar to DefaultAgentScheduler. There's a try/catch, but it only updates metrics. So what we get here is some extra logging, but I don't see how it's going to help with agent scheduling.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cleanup agents are going through each account and trying to do a cleanup of any stale alarms in cloudwatch or any detached instances etc. When a bad account is present in the credentials repository the agent stops and doesnt go through the rest of the accounts.
By adding a try/catch here we are logging that there was a problem with X account and continue to the next account.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aaah yes, the try/catch here and in CleanupDetachedInstancesAgent is inside getAccounts().each. Seems like the consequences of this fix are good...that instead of dying on the first error, we'll continue to clean up other accounts.

log.error("Error occurred while processing alarms for ${credentials.name}/${region.name}: ${e.message}", e)
}
}
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ class CleanupDetachedInstancesAgent implements RunnableAgent, CustomScheduledAge
getAccounts().each { NetflixAmazonCredentials credentials ->
credentials.regions.each { AmazonCredentials.AWSRegion region ->
log.info("Looking for instances pending termination in ${credentials.name}:${region.name}")

try {
def amazonEC2 = amazonClientProvider.getAmazonEC2(credentials, region.name, true)
def describeInstancesRequest = new DescribeInstancesRequest().withFilters(
new Filter("tag-key", [DetachInstancesAtomicOperation.TAG_PENDING_TERMINATION])
Expand Down Expand Up @@ -103,6 +103,9 @@ class CleanupDetachedInstancesAgent implements RunnableAgent, CustomScheduledAge
break
}
}
} catch (Exception e) {
log.error("Error occurred while processing instances pending termination for ${credentials.name}/${region.name}: ${e.message}", e)
}
}
}
}
Expand Down
Loading