Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci.jenkins.io] Set up an ECR pull through cache #4321

Open
Tracked by #4313
dduportal opened this issue Sep 28, 2024 · 9 comments
Open
Tracked by #4313

[ci.jenkins.io] Set up an ECR pull through cache #4321

dduportal opened this issue Sep 28, 2024 · 9 comments

Comments

@dduportal
Copy link
Contributor

dduportal commented Sep 28, 2024

During the last summer, we had to setup a Docker "Pull Through" caching registry in Azure to avoid builds breaking due to HTTP/429 rate limit from DockerHub: #4192 (comment)

(edit) we changed the plan from "using ECR (see below)" to a "using self-hosted" Docker registry in [mirror mode]. Mainly because ECR does not allow transparent proxying and we don't want having to change all Docker images name in all ATH/plugins tests + Jenkins Docker images.


Plan with a self hosted Docker registry, following the Official Docker Mirror documentation:

  • Note: "Distribution" (the Docker registry official application) is under CNCF management (ref. https://distribution.github.io/distribution/about/) which is a good indicator of a sustainable application nowadays
    • The additional effort on this will make the system easier to move around clouds (back to azure, etc.)
  • Hosted in the EKS cluster: we have room for it in the "application" node pool, next to ACP
  • The registry will be exposed using HTTP (no TLS), no authentication but only available through internal networks
    • https://docs.docker.com/reference/cli/dockerd/#insecure-registriesonly for the first step: requires using "insecure-registry"
    • No authentication because Docker Registry mirrors do not support authentication on the backends. it's the whole reason why we dropped ECR and why ACR only exists through Private Endpoints.
    • Exposed through a Kubernetes Service of type "LoadBalancer" using annotations to specify a nlb-ip internal AWS LB (same as ACP).
  • A dedicated (to avoid rate limiting other services) DockerHub token scoped to only "read" (we only need to pull images) on "public repositories" (if exposed or used, this token is only on public data: no risk of getting or wiring sensitive things) is needed to avoid hitting the Docker rate limit
  • Current storage used on ACR is close to 259Gb. Assuming we'll have a Garbage collection in place, let's start with a data volume of 250Gb. It will be automatically tracked by datadog so we'll be alerted if we need to have more space.
Image
  • We can use this on the Jenkins and Jenkins BOM node pools for faster autoscaling

  • Let's get started with the Helm chart https://github.com/twuni/docker-registry.helm: audited it and it looks fine for a beginning

    • NOT an official chart (there aren't any)
    • We should NOT try using S3. Because the filesystem driver is recommended in https://distribution.github.io/distribution/recipes/mirror/#what-about-my-disk for this use case (e.g. proxy mirror).
    • No need to get fancy with HA: our Docker engines will fallback to DockerHub by default if this registry is down during operations, including retries (built in in Docker CLI).
    • We want to enable the Garbage Collector. As described in https://distribution.github.io/distribution/about/garbage-collection/, it might not be enough as it only removes blobs without manifests: we might have to cleanup images at a moment in time (or... clean up the data disk ;))
    • We must have a short name for the helmfile release, otherwise GC and SVC cannot be created (name too long)

Nice to have (but not mandatory) in the future:

  • An helm chart managed by ourselves so we can tune what we want
    • Allow better tracking of registry version (even though we can do it through values for now)
    • Custom naming of resources (to avoid long names)
    • Avoid having to specify unused options
  • TLS, to avoid setting up "insecure registry" on Docker Engines. Might need to generate a custom certificate (to install inside agent during startups)

(old plan with ECR)

Same can be done on AWS with ECR: https://docs.aws.amazon.com/AmazonECR/latest/userguide/pull-through-cache.html

Moving ci.jenkins.io to AWS needs the same kind setup:

  • Registry MUST be private (AWS privatelink) as it cannot be using authentication through Docker Engines (might be solved using EC2 instance identity though)
  • Only DockerHub need to be synced
  • We need for VMs (with Docker CE). Nice to have for the EKS cluster (when autoscaling)
@dduportal
Copy link
Contributor Author

First step: let's create the ECR pull through cache registry. A good source of knowledge to get started is the EKS Blueprint ECR Pattern with its Terraform source code in https://github.com/aws-ia/terraform-aws-eks-blueprints/tree/main/patterns/ecr-pull-through-cache

It requires to provide a couple username/token as input parameter to Terraform's project (through the pipeline) and we'll start by creating the ECR before trying to access it in the private subnets (in agent VMs).

dduportal added a commit to jenkins-infra/kubernetes-management that referenced this issue Feb 7, 2025
@dduportal
Copy link
Contributor Author

Update: we now have an ECR cache with pull through rules. Next step: we need to set up access from EC2 agents

@dduportal
Copy link
Contributor Author

dduportal commented Feb 8, 2025

Damn, the ECR only works with.. custom image names. It's not transparent 🤦 : https://docs.aws.amazon.com/AmazonECR/latest/userguide/pull-through-cache-working-pulling.html#:~:text=Quay-,Docker%20Hub,-GitHub%20Container%20Registry

For Docker Hub official images:

docker pull aws_account_id.dkr.ecr.region.amazonaws.com/docker-hub/library/image_name:tag

Also: https://repost.aws/questions/QUFyZ2PX-XSUOe6UQhDSpg3w/ecr-pull-through-cache-rule-how-should-i-adapt-my-ci#AN3-C1Ein8RC2Gi7rVMjbp2g

=> We could use it for the EKS cluster (for faster pulls) but it makes no sense for the ci.jenkins.io VM agents (ATH, Docker builds, etc.) as it would force users to have a different image name between CI and other environments (dev., CD, etc.).

I guess we have to run a registry in mirror mode instead: https://docs.docker.com/docker-hub/image-library/mirror/ or any other alternative

@timja
Copy link
Member

timja commented Feb 8, 2025

It definitely doesn't work with registry-mirrors: [".../docker-hub"] ? (I do see the AWS answer about it but there seems to be quite limited info on it out there.

@dduportal
Copy link
Contributor Author

It definitely doesn't work with registry-mirrors: [".../docker-hub"] ? (I do see the AWS answer about it but there seems to be quite limited info on it out there.

Test in progress (had the same thought process and want to know if it works because it would be really useful)

@dduportal
Copy link
Contributor Author

It definitely doesn't work with registry-mirrors: [".../docker-hub"] ? (I do see the AWS answer about it but there seems to be quite limited info on it out there.

So it does not work as ECR requires authentication, which does not work for registry mirrors:

Feb 09 09:08:15 ip-10-0-1-240 dockerd[417046]: time="2025-02-09T09:08:15.264365428Z" level=info msg="Attempting next endpoint for pull after error: Head \"https://326712726440.dkr.ecr.us-east-2.amazonaws.com/docker-hub/v2/jenkins/jenkins/manifests/lts-jdk17\": no basic auth credentials"

@timja
Copy link
Member

timja commented Feb 9, 2025

Ah damn, I guess needs a proxy in front

@dduportal
Copy link
Contributor Author

Ah damn, I guess needs a proxy in front

Yup, but given the additional setup, worth hosting a Docker Registry in the EKS cluster in Mirror mode and expose it with the same method as ACP to agents in private agents.

Just ran a quick test with https://github.com/twuni/docker-registry.helm/tree/main and it works nice and easy.

@dduportal
Copy link
Contributor Author

dduportal commented Feb 9, 2025

Update: the issue body has been updated to explain the choice of Docker registry.
I've tested https://spegel.dev/docs/getting-started/ which seems an interesting tool, but I failed to have it working on EKS at first sight AND need addition Service creation to expose through an internal AWS LB (so not immediate).

Task list:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants