Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci.jenkins.io] Move ephemeral VM agents to AWS #4316

Closed
Tracked by #4313
dduportal opened this issue Sep 28, 2024 · 31 comments
Closed
Tracked by #4313

[ci.jenkins.io] Move ephemeral VM agents to AWS #4316

dduportal opened this issue Sep 28, 2024 · 31 comments

Comments

@dduportal
Copy link
Contributor

dduportal commented Sep 28, 2024

  • Add back EC2 AMI builds in packer image
    • Will need to create EC2 resources to allow infra.ci to build VMs and create AMIs from it (check for old packer.tf in jenkins-infra/aws history)
      • We will need a dedicated VPC, IAM user with API key and limited permissions in this VPC to allow packer using EC2
    • Start from the PR which removed it
    • Don't forget the Garbage collector!
  • Add back the EC2 Jenkins plugin
    • Also need IAM user for ci.jenkins.io controller (distinct from packer - also check jenkins-infra/aws)
    • Need to check SSH vs. inbound
    • Need to check what replaces the current Azure VM init script in our older system (CloudInit?)
    • Need to check the VM retention to be ephemeral (or 1 min idle time max.)
Copy link

github-actions bot commented Sep 28, 2024

Take a look at these similar issues to see if there isn't already a response to your problem:

  1. 74% [ci.jenkins.io] Move ephemeral Linux containers to AWS #4317

@dduportal dduportal changed the title Move ephemeral VM agents to AWS [ci.jenkins.io] Move ephemeral VM agents to AWS Sep 28, 2024
@dduportal dduportal added triage Incoming issues that need review ci.jenkins.io EC2 aws labels Sep 28, 2024
@smerle33 smerle33 added this to the infra-team-sync-2024-10-08 milestone Oct 2, 2024
@smerle33 smerle33 removed the triage Incoming issues that need review label Oct 2, 2024
@smerle33
Copy link
Contributor

smerle33 commented Oct 3, 2024

to prepare this, we (jay and I) need to create a specific user for packer-images, as for azure https://github.com/jenkins-infra/azure/blob/main/packer-resources.tf I started creating it in the aws-sponsored repository.
jenkins-infra/terraform-aws-sponsorship#4
this PR worked (just the plan) and then get merged but failed (the deploy) on main : https://infra.ci.jenkins.io/job/terraform-jobs/job/terraform-aws-sponsorship/job/main/9/ because of missing rights

we did improve the policies for the role infra-developer to be able to create the new user directly on the terraform-states repo. With numerous try and fail we manage to have the correct set of rights (private link: https://github.com/jenkins-infra/terraform-states/blob/2ba74f30dd02a497062ecd8d1e5b52a7554e66b2/aws-sponsored/role-infra-developers.tf#L193-L210)

but when replaying on the infra.ci we still got this error

AccessDenied: User: arn:aws:iam::<redacted>:user/terraform-awssponsored-production is not authorized to perform: iam:GetUser

while the deploy is working locally with the infra-developer role (terraform-developer)

aws_iam_user.terraform_packer_user: Refreshing state... [id=terraform-packer-user]
data.aws_iam_policy_document.packer: Reading...
data.aws_iam_policy_document.packer: Read complete after 0s [id=<redacted>]
aws_iam_policy.packer: Refreshing state... [id=arn:aws:iam::<redacted>:policy/packer.iam_policy]
aws_iam_access_key.terraform_packer_api_keys: Refreshing state... [id=<redacted>]
aws_iam_user_policy_attachment.allow_packer_user: Refreshing state... [id=terraform-packer-user-<redacted>]

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
-/+ destroy and then create replacement

Terraform will perform the following actions:

  # aws_iam_user_policy_attachment.allow_packer_user is tainted, so must be replaced
-/+ resource "aws_iam_user_policy_attachment" "allow_packer_user" {
      ~ id         = "terraform-packer-user-<redacted>" -> (known after apply)
        # (2 unchanged attributes hidden)
    }

Plan: 1 to add, 0 to change, 1 to destroy.
aws_iam_user_policy_attachment.allow_packer_user: Destroying... [id=terraform-packer-user-<redacted>]
aws_iam_user_policy_attachment.allow_packer_user: Destruction complete after 0s
aws_iam_user_policy_attachment.allow_packer_user: Creating...
aws_iam_user_policy_attachment.allow_packer_user: Creation complete after 1s [id=terraform-packer-user-<redacted>]

Apply complete! Resources: 1 added, 0 changed, 1 destroyed.

when checking on the UI we can see that terraform-awssponsored-production can assume role role/infra-developer so te changes should be working for it ....

@smerle33
Copy link
Contributor

smerle33 commented Oct 4, 2024

the packer user creation was moved to terraform-states hence no more problem of IAM rights

@jayfranco999
Copy link
Collaborator

Update:

The aws credentials used by user 'packer' to access packer-images is now available in sops. The PR below adds the credentials in infra-ci to build packer image templates.

jenkins-infra/kubernetes-management#5780

On testing the pipeline used to create packer-images templates, @smerle33 and I encountered an error with the GC (garbage collector) scripts:- https://infra.ci.jenkins.io/job/infra-tools/job/packer-images/job/PR-1430/11/pipeline-console/?selected-node=25

To overcome this we granted executable permissions to the cleanup scripts – jenkins-infra/packer-images#1430

On further testing of the packer-images ec2 instances, GC script ./cleanup/aws.sh passed but the next script ./cleanup/aws_images.sh threw an error with exit code 1. – https://infra.ci.jenkins.io/job/infra-tools/job/packer-images/job/PR-1430/12/pipeline-console/log?nodeId=58

Next steps will involve fixing the GC scripts and having atleast one docker.ubuntu_22.04 amazon-ebs template created by packer user.

@smerle33
Copy link
Contributor

smerle33 commented Oct 7, 2024

We try to setup our environement to use this new packer user for our run of packer locally.
we first tried as infra-developer but we bumped into hashicorp/packer#12110 that seems to avoid profile assuming.
we are updating the code to match azure way of working to be able to provide the token through environement variables to packer and play the builds with the new user token.

@jayfranco999
Copy link
Collaborator

jayfranco999 commented Oct 8, 2024

Update:

We created a user terraform-packer-user and exported the credentials to infra.ci. With this we were able to provide the necessary user policies required to create packer-images EC2 Ubuntu-22.04 arm64 and amd64 VM agents.
https://infra.ci.jenkins.io/job/infra-tools/job/packer-images/job/PR-1430/21/

Next steps involve

  • finding out how to use multiple aws_spot_instance_types for amd64 and arm64 VM agents.
  • Fixing GC scripts for aws instances

@jayfranco999
Copy link
Collaborator

jayfranco999 commented Oct 10, 2024

Update:

GC script now works for our pipeline, added the functionality that allows the AMI list to accept an empty array incase no AMI ids are found. The dry-run worked as expected.

12:10:27  == DRY RUN:
12:10:27  aws ec2 deregister-image --dry-run --image-id ami-025bfc41ca974eb4f
12:10:27  
12:10:27  == DRY RUN:
12:10:27  aws ec2 deregister-image --dry-run --image-id ami-05585bfe905b5e6fb
12:10:27  
12:10:28  == DRY RUN:
12:10:28  aws ec2 deregister-image --dry-run --image-id ami-01031637bfcffa0[70](https://infra.ci.jenkins.io/job/infra-tools/job/packer-images/job/PR-1430/30/pipeline-console/?start-byte=0&selected-node=58#log-70)
12:10:28  
12:10:28  + echo '== AWS Packer Cleanup IMAGES finished.'
12:10:28  == AWS Packer Cleanup IMAGES finished.

On further testing of our EC2 VMs, we discovered an issue that was preventing packer-images build. The apt used by agent VMs were incompatible with the outdated git_linux_version: 2.46.2 used by packer images. This was rectified by the PR – jenkins-infra/packer-images#1440

Packer-images now uses git_linux_version: 2.47.0

@smerle33
Copy link
Contributor

  • Will need to create EC2 resources to allow infra.ci to build VMs and create AMIs from it (check for old packer.tf in jenkins-infra/aws history)

nothing about that file in history, but we found the remove PR that helped us: jenkins-infra/packer-images#734

@jayfranco999
Copy link
Collaborator

jayfranco999 commented Oct 14, 2024

Due to the complexity of this PR – jenkins-infra/packer-images#1430
We are splitting the tasks into 4 stages:

  • packer-images: Add linux AMI
  • packer-images: Add Garbage Collector for AWS
  • packer-images: Add Windows AMI
  • packer-images: Optimize AMI builds (Restricted network, spot instances)

@dduportal
Copy link
Contributor Author

Update: we now have a working Windows 2019 template (JDK17 for agent, JDK21 for default java) since jenkins-infra/jenkins-infra#3804 and jenkins-infra/jenkins-infra#3805, with the following elements:

  • VM takes 2 to 6 min to start up before SSH => might be improved by:
    • EC2 Fast Launch which requires setup on the AMI, but it seems it does not require anything on the EC2 plugin (to be verified!
    • Try using the EC2 WinRM connection method instead of Unix SSH (is it faster?)
    • Analyze why SSH takes so much time before accepting the user key (1 to 3 minute between the first accepted SSH connection and the valid authentication)
  • Test with Windows 2022
  • Set up the "cloud init" and "init script" as code in puppet with:
    • Use the YAML syntax for cloud init (as the <powershell> XML syntax fails when specified to the EC2 plugins => worth an RFE) with reusability of the Azure VM Windows set of instructions
    • Use the same technique as with EC2 Linux by creating a token file at the end of cloud init, and check the presence of this file in the init script, to make sure everything works as expected
  • Set up the PATH configuration with the new syntax (e.g. appending: key: PATH and value: ${PATH};xxxx for Windows
    • Update Linux init script with the same technique
    • But keep the cloud init java selection in both cases

@dduportal
Copy link
Contributor Author

Update after running a few tests:

@dduportal
Copy link
Contributor Author

Update:

15:10:35  Still waiting to schedule task
15:10:35  Waiting for next available executor
15:15:04  Running on [EC2 (aws-us-east-2) - Windows 2019 x86_64 with JDK21 (i-01413e69a5275f018)](https://3.146.166.108/computer/EC2%20%28aws%2Dus%2Deast%2D2%29%20%2D%20Windows%202019%20x86%5F64%20with%20JDK21%20%28i%2D01413e69a5275f018%29/) in C:/Jenkins/agent/workspace/test
[Pipeline] {
[Pipeline] bat
15:15:11  
15:15:14  administrator@EC2AMAZ-TJBM10D C:\Jenkins\agent\workspace\test>mvn -v 
15:15:27  Apache Maven 3.9.9 (8e8579a9e76f7d015ee5ec7bfcdc97d260186937)
15:15:27  Maven home: C:\tools\apache-maven-3.9.9
15:15:27  Java version: 21.0.5, vendor: Eclipse Adoptium, runtime: C:\tools\jdk-21
15:15:27  Default locale: en_US, platform encoding: UTF-8
15:15:27  OS name: "windows server 2019", version: "10.0", arch: "amd64", family: "windows"
[Pipeline] bat
15:15:28  
15:15:28  administrator@EC2AMAZ-TJBM10D C:\Jenkins\agent\workspace\test>java --version 
15:15:28  openjdk 21.0.5 2024-10-15 LTS
15:15:28  OpenJDK Runtime Environment Temurin-21.0.5+11 (build 21.0.5+11-LTS)
15:15:28  OpenJDK 64-Bit Server VM Temurin-21.0.5+11 (build 21.0.5+11-LTS, mixed mode, sharing)
  • Fast Launch build is almos there (unrelated build error).

Next step:

@dduportal
Copy link
Contributor Author

Just enabled Fast Launch in jenkins-infra/jenkins-infra#3826. The result is impressive: less than 2 min to run a Windows agent:

12:52:18  Still waiting to schedule task
12:52:18  ‘[EC2 (aws-us-east-2) - Windows 2019 x86_64 with JDK21 (i-0a7507cb13842640a)](https://3.146.166.108/computer/EC2%20%28aws%2Dus%2Deast%2D2%29%20%2D%20Windows%202019%20x86%5F64%20with%20JDK21%20%28i%2D0a7507cb13842640a%29/)’ is offline
12:53:35  Running on [EC2 (aws-us-east-2) - Windows 2019 x86_64 with JDK21 (i-0a7507cb13842640a)](https://3.146.166.108/computer/EC2%20%28aws%2Dus%2Deast%2D2%29%20%2D%20Windows%202019%20x86%5F64%20with%20JDK21%20%28i%2D0a7507cb13842640a%29/) in C:/Jenkins/agent/workspace/test
[Pipeline] {
[Pipeline] bat
12:53:41  
12:53:44  administrator@EC2AMAZ-SB4UVM8 C:\Jenkins\agent\workspace\test>mvn -v

Of course it has a cost (we have a LOT of snapshots) but worth it!

@dduportal
Copy link
Contributor Author

dduportal commented Jan 28, 2025

Update: testing in progress with https://github.com/jenkinsci/docker and https://github.com/jenkinsci/acceptance-test-harness

A few elements:

  • The latest ec2 plugin version (1822.v87175d209b_b_5) has regressions preventing Windows VM agents to start
    • Looks like the SSH launcher with Windows does not play well with this version in which the SSH java native implementation was changed
    • Gotta report to the plugin (already did raise awareness to the change author)
    • Roll backed the plugin to https://plugins.jenkins.io/ec2/releases/#version_1801.v526399543dca_ for now: we'll decide if we need to pin it or wait for a bugfix and new version
  • After the plugin downgrade, the job jenkinsci/docker (https://aws.ci.jenkins.io/job/docker/) was tested:
    • It is green despite failing: we have to fix the associated pipeline library (Docker credentials not used because aws.ci.jenkins.io is not recognized when docker logging-in). Error message is Cannot use Docker credentials outside of jenkins infra environments
    • The VM were spinned up with success in less than 1m30 for each 👍
    • We will most probably be blocked by [ci.jenkins.io] Set up an ECR pull through cache #4321 but at least the agent VM spin up works (great first step)
  • Then, ATH which failed:
    • First element: aws.ci.jenkins.io does not provide any template of any kind which met the node {} step (e.g. without label). We must fix the configuration to map to what we have in ci.jenkins.io (e.g. the jnlp default pod template) otherwise the first stage of the ATH pipeline is stuck waiting for an agent)

    • Then: we were missing the launchable credential for ATH (same as for the BOM). Created manually, along with the Jenkins Core one (will be useful for later tests).

    • Then: the agents are started but the build is failing when reaching, for each, the mvn command in the run.sh:

      + BROWSER=firefox
      + JENKINS_WAR=target/jenkins-war.war
      + mvn -V -e -ntp -Plts -Dmaven.repo.local=/home/jenkins/agent/workspace/ath_master@tmp/m2repo -Dmaven.test.failure.ignore=true -Dcsp.rule=false -DforkCount=1 -B
      The JAVA_HOME environment variable is not defined correctly,
      this environment variable is needed to run this program.
      

@dduportal
Copy link
Contributor Author

Then: the agents are started but the build is failing when reaching, for each, the mvn command in the run.sh:

@dduportal
Copy link
Contributor Author

Update:

=> test in progress on https://aws.ci.jenkins.io/job/jenkins-infra-test-plugin and then on ATH

@dduportal
Copy link
Contributor Author

The latest ec2 plugin version (1822.v87175d209b_b_5) has regressions preventing Windows VM agents to start

This version has been excluded from UC: jenkins-infra/update-center2#837. It was reported as faulty by users in https://issues.jenkins.io/browse/JENKINS-75187.
Worked with the change author to confirm that his fixes (not published yet) are working as expected on aws.ci.jenkins.io!

@dduportal
Copy link
Contributor Author

=> test in progress on https://aws.ci.jenkins.io/job/jenkins-infra-test-plugin and then on ATH

Damn ACP is not reachable. Is the URL the correct one 🤔

WARNING: the artifact caching proxy server 'http://k8s-artifact-artifact-51a40b09ac-8673e1bdefab0d64.elb.us-east-2.amazonaws.com:8080/' isn't reachable, will use repo.jenkins-ci.org.

@dduportal
Copy link
Contributor Author

=> test in progress on https://aws.ci.jenkins.io/job/jenkins-infra-test-plugin and then on ATH

Damn ACP is not reachable. Is the URL the correct one 🤔

WARNING: the artifact caching proxy server 'http://k8s-artifact-artifact-51a40b09ac-8673e1bdefab0d64.elb.us-east-2.amazonaws.com:8080/' isn't reachable, will use repo.jenkins-ci.org.

Fixed by jenkins-infra/jenkins-infra#3850. First try on https://aws.ci.jenkins.io/job/jenkins-infra-test-plugin/job/master/7/ looks good: ACP was used through the LB.

Next testing:

@dduportal
Copy link
Contributor Author

https://issues.jenkins.io/browse/JENKINS-75187?focusedId=452172&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-452172 => everything looks good!

=> We can close this issue once we have open one on the pipeline library around the docker login thing

@dduportal
Copy link
Contributor Author

https://issues.jenkins.io/browse/JENKINS-75187?focusedId=452172&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-452172 => everything looks good!

=> We can close this issue once we have open one on the pipeline library around the docker login thing

jenkins-infra/pipeline-library#904

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants