Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add Option to Prevent Overlapping Replications of the Same Artifact in Harbor #21347

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

bupd
Copy link
Contributor

@bupd bupd commented Dec 22, 2024

Single Active Replication per replication policy

Proposal: goharbor/community#256

Summary

This PR addresses a long-standing issue where overlapping replications of the same artifact can occur in Harbor, leading to unnecessary resource consumption and poor performance. By introducing a "Disable parallel replication" checkbox in the replication policy, it ensures that replication tasks for the same policy do not run in parallel, preventing bandwidth overload and queue backups, especially for large artifacts.

Similar Issues

Related Issues

Why do we need this

  1. Users have for long requesting this feature to have single replication execution per policy.
  2. Avoid overlapping replications which is common in using scheduled replications.
  3. Without this replication starts piling up.
  4. And at some point it becomes un-handleable... leading to a snowball effect on bandwidth and the queue.
  5. It makes no sense to run replication in parallel for same replication policy.
  6. It is better to atleast have an option to set replication to single execution at a time per policy.
  7. It is a huge problem when using bigger artifacts that are more than 512mb in size running parallel replication.

Changes Made

  • Added Single active replication Checkbox in Replication UI.
  • Updated Replication Policy.
  • Updated worker.
  • Added single_active_replication column in sql.

Screenshots

image

Screenshot_2025-01-08_18-44-27

Observation on Replication Performance with and without Feature

Speed Test Results:

Upload Speed: 9.50 Mbps (Data Used: 4.3 MB)
Latency: 37.96 ms (Jitter: 1.76 ms, Min: 25.11 ms, Max: 43.32 ms)
Packet Loss: 0.0%
Result URL: Speedtest Result

Images to Replicate: Variations of docker.io/vad1mo/1gb-random-file

Workflow 1 (with Feature):

1 x image (512 MB)
3 x images (~1.5 GB)
From: reg1.bupd.xyz to reg2.bupd.xyz
Replication: Normal
Started: 2:27 PM
Completed: 2:49 PM
Bandwidth Used: 1.42 GB
Theoretical Time: 22.5 minutes for 1.5 GB
Actual Time: 22 minutes (No packet loss or bandwidth issues)

Insight

Replication with the feature enabled completed within the expected timeframe, showing stable upload speed and no packet loss.

Workflow 2 (without Feature):

Started: 2:55 PM
Name: Destroyer
Bandwidth Used: 13+ GB
Result: Failed
Time Taken: ~4+hrs

Status of Instance (no longer functioning)

Screenshot_2025-01-08_19-25-59

Insight

The replication process failed without the feature enabled, highlighting that the feature is likely essential for successful image transfers between reg1.bupd.xyz and reg2.bupd.xyz. This indicates that the feature could be critical for ensuring stable uploads and completing transfers within the expected time.

Conclusion:

This observation suggests that this feature is necessary for reliable replication between registries. The data transfer rates and lack of packet loss when the feature is used make it an essential component for stable image replication.

Todo

  • Add Tests

Please indicate you've done the following:

  • Well Written Title and Summary of the PR
  • Label the PR as needed. "release-note/ignore-for-release, release-note/new-feature, release-note/update, release-note/enhancement, release-note/community, release-note/breaking-change, release-note/docs, release-note/infra, release-note/deprecation"
  • Accepted the DCO. Commits without the DCO will delay acceptance.
  • Made sure tests are passing and test coverage is added if needed.
  • Considered the docs impact and opened a new docs issue or PR with docs changes if needed in website repository.

Fix #19937

@bupd bupd added the release-note/new-feature New Harbor Feature label Dec 22, 2024
Copy link

codecov bot commented Dec 22, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 46.18%. Comparing base (c8c11b4) to head (380d27f).
Report is 368 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main   #21347      +/-   ##
==========================================
+ Coverage   45.36%   46.18%   +0.81%     
==========================================
  Files         244      247       +3     
  Lines       13333    13883     +550     
  Branches     2719     2875     +156     
==========================================
+ Hits         6049     6412     +363     
- Misses       6983     7134     +151     
- Partials      301      337      +36     
Flag Coverage Δ
unittests 46.18% <ø> (+0.81%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

see 491 files with indirect coverage changes

@bupd bupd force-pushed the feat/add-job-skipper branch from 6763a2c to a997836 Compare December 22, 2024 17:43
@bupd bupd marked this pull request as ready for review December 22, 2024 18:23
@bupd bupd requested a review from a team as a code owner December 22, 2024 18:23
@wy65701436
Copy link
Contributor

Thanks, @bupd, for your contribution! I believe this is a valuable new feature for Harbor that should follow the proposal process.

There are several questions we need to briefly discuss:

  • What happens if the execution gets stuck in the 'running' status?
  • How do we determine the relationship between the replication policy and the execution instance — by ID? And what happens if the policy content is updated after the execution begins?
  • During the replication job (at least in Harbor-to-Harbor replication), Harbor skips artifacts that have already been successfully replicated. So, we should consider how much additional benefit this feature will bring.

api/v2.0/swagger.yaml Outdated Show resolved Hide resolved
src/portal/src/i18n/lang/en-us-lang.json Outdated Show resolved Hide resolved
src/portal/src/i18n/lang/en-us-lang.json Outdated Show resolved Hide resolved
@Vad1mo Vad1mo enabled auto-merge (squash) December 23, 2024 16:50
@bupd bupd changed the title Add Option to Skip Replication Policy Execution If an Execution is already running feat: Add Option to Prevent Overlapping Replications of the Same Artifact in Harbor Dec 23, 2024
@tpoxa
Copy link
Contributor

tpoxa commented Dec 28, 2024

Thanks, @bupd, for your contribution! I believe this is a valuable new feature for Harbor that should follow the proposal process.

There are several questions we need to briefly discuss:

  • What happens if the execution gets stuck in the 'running' status?

We were planning to use JobServiceMonitorClient to get the current state of job observations. As long observation says it's running - a new execution will be skipped. I believe we should fix the problem with stuck separately. What are your thoughts about that?

  • How do we determine the relationship between the replication policy and the execution instance — by ID? And what happens if the policy content is updated after the execution begins?

Indeed PolicyID was never needed before. We can add it anyway to the job arguments so we can scan observations, parse job arguments and match policyID with the current.

  • During the replication job (at least in Harbor-to-Harbor replication), Harbor skips artifacts that have already been successfully replicated. So, we should consider how much additional benefit this feature will bring.

I believe even checking an artefact for existence produces an additonal HTTP request. So, we still adding some extra efficiency.

@Vad1mo
Copy link
Member

Vad1mo commented Dec 30, 2024

  • During the replication job (at least in Harbor-to-Harbor replication), Harbor skips artifacts that have already been successfully replicated. So, we should consider how much additional benefit this feature will bring.

Regarding that statement. there is a fundamental deficit with harbor replications, when the layer is in the copy process, new replication will try to copy the same blob again. If you have a slow connection like DSL the same blob will be copied over and over again. This results in the domino effect. Each new replication halves the throughput, resulting in slower transfers, and because all replications run in parallel, the same effect happens with the next blob/layer.

We have made an experiment with:

  • Copying 1 GiB image
  • Limit transfer speed 1 Mbit, per replication
    • Please note that we set the 1Mib per replication. On a real DSL, 3G connection, the total bandwidth (1 Mbit) would not be shared across all replications, so each replication would get 1/n of the bandwidth.
    • So the observed time needs to be multiplied by n.
  • Replication every minute
  • Replication time with "Prevent Overlapping Replications" Enabled ->: 22 min
  • Replications time with "Prevent Overlapping Replications" Disabled -> : Failed after 4+ hours

Maksym Trofimenko and others added 5 commits January 8, 2025 06:08
Signed-off-by: bupd <[email protected]>
Co-authored-by: Maksym Trofimenko <[email protected]>
Signed-off-by: bupd <[email protected]>
Co-authored-by: Maksym Trofimenko <[email protected]>
* Adds Skip If Runnning to Replication Policy

Signed-off-by: bupd <[email protected]>
* Adds Checkbox in replication policy UI
* Updates swagger & sql schema

Signed-off-by: bupd <[email protected]>
@bupd bupd force-pushed the feat/add-job-skipper branch from a997836 to a901167 Compare January 8, 2025 00:41
* renamed all occurences to single active replication
* updated description and tests

Signed-off-by: bupd <[email protected]>
@bupd bupd force-pushed the feat/add-job-skipper branch from a901167 to f50e1c2 Compare January 8, 2025 00:56
Signed-off-by: bupd <[email protected]>
@bupd bupd force-pushed the feat/add-job-skipper branch from fa151c0 to 9d9b400 Compare January 8, 2025 01:14
@bupd
Copy link
Contributor Author

bupd commented Jan 8, 2025

Hello @wy65701436

The below are my observations.

  • What happens if the execution gets stuck in the 'running' status?

While single active replication is checked. Replication executions per policy will become to atmost one. So if an execution is already running. other scheduled executions will be defered.

  • How do we determine the relationship between the replication policy and the execution instance — by ID? And what happens if the policy content is updated after the execution begins?
  1. each execution will have a replication policyID attached to it. so on executing the new jobs we check if there is any jobs that are already running with same execution policyID. if we find any jobs that are in progress. we will defer the executions until the previous job completes.
  2. By design, once harbor replication execution starts the policy content is fixed. and changes made to replication policy will only reflect in upcoming replication executions.
  • During the replication job (at least in Harbor-to-Harbor replication), Harbor skips artifacts that have already been successfully replicated. So, we should consider how much additional benefit this feature will bring.

Yes exactly, Harbor only skips artifacts that have already been successfully replicated.
So in typical scenario harbor will try copying the same layers with different jobs pushing the same layer.

In the below scenario I tried replicating three images each of size: 512mb. but the resulting repository quota
resulting something similar like this.

Project Quota

Where the quota used should only be less than 1.5gb.

Screenshot_2025-01-08_16-02-46
4.77GB is used here for project due to replication jobs copying the same layer simultaneously.
this clearly shows the quota consumed is greater than the images needed to be replicated.

The below is the overall quota

Attaching here for reference.

Screenshot_2025-01-08_16-13-42

Thanks

You can check the two instances used on this experiment: reg1.bupd.xyz & reg2.bupd.xyz

@bupd
Copy link
Contributor Author

bupd commented Jan 8, 2025

Observation on Replication Performance with and without Feature

Speed Test Results:

Upload Speed: 9.50 Mbps (Data Used: 4.3 MB)
Latency: 37.96 ms (Jitter: 1.76 ms, Min: 25.11 ms, Max: 43.32 ms)
Packet Loss: 0.0%
Result URL: Speedtest Result

Images to Replicate: Variations of docker.io/vad1mo/1gb-random-file

Workflow 1 (with Feature):

1 x image (512 MB)
3 x images (~1.5 GB)
From: reg1.bupd.xyz to reg2.bupd.xyz
Replication: Normal
Started: 2:27 PM
Completed: 2:49 PM
Bandwidth Used: 1.42 GB
Theoretical Time: 22.5 minutes for 1.5 GB
Actual Time: 22 minutes (No packet loss or bandwidth issues)

Insight

Replication with the feature enabled completed within the expected timeframe, showing stable upload speed and no packet loss.

Workflow 2 (without Feature):

Started: 2:55 PM
Name: Destroyer
Bandwidth Used: 13+ GB
Result: Failed
Time Taken: ~4+hrs (failed to complete)

Status of Instance

Screenshot_2025-01-08_19-25-59

Insight

The replication process failed without the feature enabled, highlighting that the feature is likely essential for successful image transfers between reg1.bupd.xyz and reg2.bupd.xyz. This indicates that the feature could be critical for ensuring stable uploads and completing transfers within the expected time.

Conclusion:

This observation suggests that this feature is necessary for reliable replication between registries. The data transfer rates and lack of packet loss when the feature is used make it an essential component for stable image replication.

@bupd
Copy link
Contributor Author

bupd commented Jan 9, 2025

you can say a workaround here:

please set scheduled replication interval more than enough time it needs. to be safe.
-- or --
The cron string is 0 */2 * * * *, it means that the replication is start every 2 minutes, you should always keep the schedule interval longer than a single job complete time.
#20532 (comment)

The workaround of setting a longer replication interval, like once a day, fails to address the need for timely synchronization across registries. For users who rely on Harbor to maintain identical registries at different locations, frequent replication (e.g., every 5 minutes) is necessary to ensure minimal discrepancies between registries. By suggesting a longer interval, users may end up with outdated or inconsistent images, undermining the core functionality of replication.

Thanks.

checks policyID before creating tasks

Signed-off-by: Maksym Trofimenko <[email protected]>
auto-merge was automatically disabled January 20, 2025 00:15

Head branch was pushed to by a user without write access

Signed-off-by: Maksym Trofimenko <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants