feat: Add Option to Prevent Overlapping Replications of the Same Artifact in Harbor #21347

bupd · 2024-12-22T16:50:31Z

Single Active Replication per replication policy

Proposal: goharbor/community#256

Summary

This PR addresses a long-standing issue where overlapping replications of the same artifact can occur in Harbor, leading to unnecessary resource consumption and poor performance. By introducing a "Disable parallel replication" checkbox in the replication policy, it ensures that replication tasks for the same policy do not run in parallel, preventing bandwidth overload and queue backups, especially for large artifacts.

Similar Issues

Related Issues

Why do we need this

Users have for long requesting this feature to have single replication execution per policy.
Avoid overlapping replications which is common in using scheduled replications.
Without this replication starts piling up.
And at some point it becomes un-handleable... leading to a snowball effect on bandwidth and the queue.
It makes no sense to run replication in parallel for same replication policy.
It is better to atleast have an option to set replication to single execution at a time per policy.
It is a huge problem when using bigger artifacts that are more than 512mb in size running parallel replication.

Changes Made

Added Single active replication Checkbox in Replication UI.
Updated Replication Policy.
Updated worker.
Added single_active_replication column in sql.

Screenshots

Observation on Replication Performance with and without Feature

Speed Test Results:

Upload Speed: 9.50 Mbps (Data Used: 4.3 MB)
Latency: 37.96 ms (Jitter: 1.76 ms, Min: 25.11 ms, Max: 43.32 ms)
Packet Loss: 0.0%
Result URL: Speedtest Result

Images to Replicate: Variations of docker.io/vad1mo/1gb-random-file

Workflow 1 (with Feature):

1 x image (512 MB)
3 x images (~1.5 GB)
From: reg1.bupd.xyz to reg2.bupd.xyz
Replication: Normal
Started: 2:27 PM
Completed: 2:49 PM
Bandwidth Used: 1.42 GB
Theoretical Time: 22.5 minutes for 1.5 GB
Actual Time: 22 minutes (No packet loss or bandwidth issues)

Insight

Replication with the feature enabled completed within the expected timeframe, showing stable upload speed and no packet loss.

Workflow 2 (without Feature):

Started: 2:55 PM
Name: Destroyer
Bandwidth Used: 13+ GB
Result: Failed
Time Taken: ~4+hrs

Status of Instance (no longer functioning)

Insight

The replication process failed without the feature enabled, highlighting that the feature is likely essential for successful image transfers between reg1.bupd.xyz and reg2.bupd.xyz. This indicates that the feature could be critical for ensuring stable uploads and completing transfers within the expected time.

Conclusion:

This observation suggests that this feature is necessary for reliable replication between registries. The data transfer rates and lack of packet loss when the feature is used make it an essential component for stable image replication.

Todo

Add Tests

Please indicate you've done the following:

Well Written Title and Summary of the PR
Label the PR as needed. "release-note/ignore-for-release, release-note/new-feature, release-note/update, release-note/enhancement, release-note/community, release-note/breaking-change, release-note/docs, release-note/infra, release-note/deprecation"
Accepted the DCO. Commits without the DCO will delay acceptance.
Made sure tests are passing and test coverage is added if needed.
Considered the docs impact and opened a new docs issue or PR with docs changes if needed in website repository.

Fix #19937

codecov · 2024-12-22T16:53:10Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 46.18%. Comparing base (c8c11b4) to head (380d27f).
Report is 368 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #21347      +/-   ##
==========================================
+ Coverage   45.36%   46.18%   +0.81%     
==========================================
  Files         244      247       +3     
  Lines       13333    13883     +550     
  Branches     2719     2875     +156     
==========================================
+ Hits         6049     6412     +363     
- Misses       6983     7134     +151     
- Partials      301      337      +36

Flag	Coverage Δ
unittests	`46.18% <ø> (+0.81%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

see 491 files with indirect coverage changes

wy65701436 · 2024-12-23T09:29:48Z

Thanks, @bupd, for your contribution! I believe this is a valuable new feature for Harbor that should follow the proposal process.

There are several questions we need to briefly discuss:

What happens if the execution gets stuck in the 'running' status?
How do we determine the relationship between the replication policy and the execution instance — by ID? And what happens if the policy content is updated after the execution begins?
During the replication job (at least in Harbor-to-Harbor replication), Harbor skips artifacts that have already been successfully replicated. So, we should consider how much additional benefit this feature will bring.

api/v2.0/swagger.yaml

src/portal/src/i18n/lang/en-us-lang.json

tpoxa · 2024-12-28T20:21:58Z

Thanks, @bupd, for your contribution! I believe this is a valuable new feature for Harbor that should follow the proposal process.

There are several questions we need to briefly discuss:

What happens if the execution gets stuck in the 'running' status?

We were planning to use JobServiceMonitorClient to get the current state of job observations. As long observation says it's running - a new execution will be skipped. I believe we should fix the problem with stuck separately. What are your thoughts about that?

How do we determine the relationship between the replication policy and the execution instance — by ID? And what happens if the policy content is updated after the execution begins?

Indeed PolicyID was never needed before. We can add it anyway to the job arguments so we can scan observations, parse job arguments and match policyID with the current.

During the replication job (at least in Harbor-to-Harbor replication), Harbor skips artifacts that have already been successfully replicated. So, we should consider how much additional benefit this feature will bring.

I believe even checking an artefact for existence produces an additonal HTTP request. So, we still adding some extra efficiency.

Vad1mo · 2024-12-30T12:46:48Z

During the replication job (at least in Harbor-to-Harbor replication), Harbor skips artifacts that have already been successfully replicated. So, we should consider how much additional benefit this feature will bring.

Regarding that statement. there is a fundamental deficit with harbor replications, when the layer is in the copy process, new replication will try to copy the same blob again. If you have a slow connection like DSL the same blob will be copied over and over again. This results in the domino effect. Each new replication halves the throughput, resulting in slower transfers, and because all replications run in parallel, the same effect happens with the next blob/layer.

We have made an experiment with:

Copying 1 GiB image
Limit transfer speed 1 Mbit, per replication
- Please note that we set the 1Mib per replication. On a real DSL, 3G connection, the total bandwidth (1 Mbit) would not be shared across all replications, so each replication would get 1/n of the bandwidth.
- So the observed time needs to be multiplied by n.
Replication every minute
Replication time with "Prevent Overlapping Replications" Enabled ->: 22 min
Replications time with "Prevent Overlapping Replications" Disabled -> : Failed after 4+ hours

Signed-off-by: bupd <[email protected]> Co-authored-by: Maksym Trofimenko <[email protected]>

* Adds Skip If Runnning to Replication Policy Signed-off-by: bupd <[email protected]>

* Adds Checkbox in replication policy UI * Updates swagger & sql schema Signed-off-by: bupd <[email protected]>

Signed-off-by: bupd <[email protected]>

* renamed all occurences to single active replication * updated description and tests Signed-off-by: bupd <[email protected]>

Signed-off-by: bupd <[email protected]>

bupd · 2025-01-08T10:50:55Z

Hello @wy65701436

The below are my observations.

What happens if the execution gets stuck in the 'running' status?

While single active replication is checked. Replication executions per policy will become to atmost one. So if an execution is already running. other scheduled executions will be defered.

How do we determine the relationship between the replication policy and the execution instance — by ID? And what happens if the policy content is updated after the execution begins?

each execution will have a replication policyID attached to it. so on executing the new jobs we check if there is any jobs that are already running with same execution policyID. if we find any jobs that are in progress. we will defer the executions until the previous job completes.
By design, once harbor replication execution starts the policy content is fixed. and changes made to replication policy will only reflect in upcoming replication executions.

During the replication job (at least in Harbor-to-Harbor replication), Harbor skips artifacts that have already been successfully replicated. So, we should consider how much additional benefit this feature will bring.

Yes exactly, Harbor only skips artifacts that have already been successfully replicated.
So in typical scenario harbor will try copying the same layers with different jobs pushing the same layer.

In the below scenario I tried replicating three images each of size: 512mb. but the resulting repository quota
resulting something similar like this.

Project Quota

Where the quota used should only be less than 1.5gb.

4.77GB is used here for project due to replication jobs copying the same layer simultaneously.
this clearly shows the quota consumed is greater than the images needed to be replicated.

The below is the overall quota

Attaching here for reference.

Thanks

You can check the two instances used on this experiment: reg1.bupd.xyz & reg2.bupd.xyz

bupd · 2025-01-08T13:40:13Z

Observation on Replication Performance with and without Feature

Speed Test Results:

Upload Speed: 9.50 Mbps (Data Used: 4.3 MB)
Latency: 37.96 ms (Jitter: 1.76 ms, Min: 25.11 ms, Max: 43.32 ms)
Packet Loss: 0.0%
Result URL: Speedtest Result

Images to Replicate: Variations of docker.io/vad1mo/1gb-random-file

Workflow 1 (with Feature):

1 x image (512 MB)
3 x images (~1.5 GB)
From: reg1.bupd.xyz to reg2.bupd.xyz
Replication: Normal
Started: 2:27 PM
Completed: 2:49 PM
Bandwidth Used: 1.42 GB
Theoretical Time: 22.5 minutes for 1.5 GB
Actual Time: 22 minutes (No packet loss or bandwidth issues)

Insight

Replication with the feature enabled completed within the expected timeframe, showing stable upload speed and no packet loss.

Workflow 2 (without Feature):

Started: 2:55 PM
Name: Destroyer
Bandwidth Used: 13+ GB
Result: Failed
Time Taken: ~4+hrs (failed to complete)

Status of Instance

Insight

The replication process failed without the feature enabled, highlighting that the feature is likely essential for successful image transfers between reg1.bupd.xyz and reg2.bupd.xyz. This indicates that the feature could be critical for ensuring stable uploads and completing transfers within the expected time.

Conclusion:

This observation suggests that this feature is necessary for reliable replication between registries. The data transfer rates and lack of packet loss when the feature is used make it an essential component for stable image replication.

bupd · 2025-01-09T06:23:01Z

you can say a workaround here:

please set scheduled replication interval more than enough time it needs. to be safe.
-- or --
The cron string is 0 */2 * * * *, it means that the replication is start every 2 minutes, you should always keep the schedule interval longer than a single job complete time.
#20532 (comment)

The workaround of setting a longer replication interval, like once a day, fails to address the need for timely synchronization across registries. For users who rely on Harbor to maintain identical registries at different locations, frequent replication (e.g., every 5 minutes) is necessary to ensure minimal discrepancies between registries. By suggesting a longer interval, users may end up with outdated or inconsistent images, undermining the core functionality of replication.

Thanks.

checks policyID before creating tasks Signed-off-by: Maksym Trofimenko <[email protected]>

Signed-off-by: Maksym Trofimenko <[email protected]>

bupd added the release-note/new-feature New Harbor Feature label Dec 22, 2024

bupd force-pushed the feat/add-job-skipper branch from 6763a2c to a997836 Compare December 22, 2024 17:43

bupd marked this pull request as ready for review December 22, 2024 18:23

bupd requested a review from a team as a code owner December 22, 2024 18:23

github-actions bot assigned MinerYang, stonezdj and Vad1mo Dec 22, 2024

bupd added the area/replication label Dec 22, 2024

wy65701436 added needs/proposal needs/design labels Dec 23, 2024

Vad1mo requested changes Dec 23, 2024

View reviewed changes

api/v2.0/swagger.yaml Outdated Show resolved Hide resolved

src/portal/src/i18n/lang/en-us-lang.json Outdated Show resolved Hide resolved

src/portal/src/i18n/lang/en-us-lang.json Outdated Show resolved Hide resolved

Vad1mo enabled auto-merge (squash) December 23, 2024 16:50

bupd changed the title ~~Add Option to Skip Replication Policy Execution If an Execution is already running~~ feat: Add Option to Prevent Overlapping Replications of the Same Artifact in Harbor Dec 23, 2024

Maksym Trofimenko and others added 5 commits January 8, 2025 06:08

add checkConcurrency gocraft/work middleware WIP

a90c66d

Signed-off-by: bupd <[email protected]> Co-authored-by: Maksym Trofimenko <[email protected]>

add jobs skipper WIP

239595e

Signed-off-by: bupd <[email protected]> Co-authored-by: Maksym Trofimenko <[email protected]>

add skip If Running option in Replication policy

ad38815

* Adds Skip If Runnning to Replication Policy Signed-off-by: bupd <[email protected]>

Add Skip if running checkbox in replication & update swagger

dd55f50

* Adds Checkbox in replication policy UI * Updates swagger & sql schema Signed-off-by: bupd <[email protected]>

update mocks & lint

cef5984

Signed-off-by: bupd <[email protected]>

bupd force-pushed the feat/add-job-skipper branch from a997836 to a901167 Compare January 8, 2025 00:41

rename skip if running to single active replication

f50e1c2

* renamed all occurences to single active replication * updated description and tests Signed-off-by: bupd <[email protected]>

bupd force-pushed the feat/add-job-skipper branch from a901167 to f50e1c2 Compare January 8, 2025 00:56

fix lint

9d9b400

Signed-off-by: bupd <[email protected]>

bupd force-pushed the feat/add-job-skipper branch from fa151c0 to 9d9b400 Compare January 8, 2025 01:14

bupd added 2 commits January 8, 2025 13:07

make skip as final status

c0ffee8

Signed-off-by: bupd <[email protected]>

update UI and message

c0ffee1

Signed-off-by: bupd <[email protected]>

This was referenced Jan 8, 2025

Add Single Active Replication Proposal goharbor/community#256

Open

Add Single Active Replication in Replication Policy for v2.12 #21394

Closed

This was linked to issues Jan 9, 2025

Maximum parallel replications for pull #20532

Closed

Replication: avoid overlapping replication jobs #17269

Closed

Merge branch 'main' into feat/add-job-skipper

cd8c84f

Vad1mo approved these changes Jan 10, 2025

View reviewed changes

adds observations manager

51763e1

checks policyID before creating tasks Signed-off-by: Maksym Trofimenko <[email protected]>

auto-merge was automatically disabled January 20, 2025 00:15
Head branch was pushed to by a user without write access

fixes execution not found error

380d27f

Signed-off-by: Maksym Trofimenko <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add Option to Prevent Overlapping Replications of the Same Artifact in Harbor #21347

feat: Add Option to Prevent Overlapping Replications of the Same Artifact in Harbor #21347

bupd commented Dec 22, 2024 •

edited

Loading

codecov bot commented Dec 22, 2024 •

edited

Loading

wy65701436 commented Dec 23, 2024

tpoxa commented Dec 28, 2024

Vad1mo commented Dec 30, 2024 •

edited

Loading

bupd commented Jan 8, 2025

bupd commented Jan 8, 2025 •

edited by Vad1mo

Loading

bupd commented Jan 9, 2025 •

edited

Loading

feat: Add Option to Prevent Overlapping Replications of the Same Artifact in Harbor #21347

Are you sure you want to change the base?

feat: Add Option to Prevent Overlapping Replications of the Same Artifact in Harbor #21347

Conversation

bupd commented Dec 22, 2024 • edited Loading

Single Active Replication per replication policy

Proposal: goharbor/community#256

Summary

Similar Issues

Related Issues

Why do we need this

Changes Made

Screenshots

Observation on Replication Performance with and without Feature

Speed Test Results:

Images to Replicate: Variations of docker.io/vad1mo/1gb-random-file

Workflow 1 (with Feature):

Insight

Workflow 2 (without Feature):

Status of Instance (no longer functioning)

Insight

Conclusion:

Todo

codecov bot commented Dec 22, 2024 • edited Loading

Codecov Report

wy65701436 commented Dec 23, 2024

tpoxa commented Dec 28, 2024

Vad1mo commented Dec 30, 2024 • edited Loading

bupd commented Jan 8, 2025

Project Quota

The below is the overall quota

bupd commented Jan 8, 2025 • edited by Vad1mo Loading

Observation on Replication Performance with and without Feature

Speed Test Results:

Images to Replicate: Variations of docker.io/vad1mo/1gb-random-file

Workflow 1 (with Feature):

Insight

Workflow 2 (without Feature):

Status of Instance

Insight

Conclusion:

bupd commented Jan 9, 2025 • edited Loading

bupd commented Dec 22, 2024 •

edited

Loading

codecov bot commented Dec 22, 2024 •

edited

Loading

Vad1mo commented Dec 30, 2024 •

edited

Loading

bupd commented Jan 8, 2025 •

edited by Vad1mo

Loading

bupd commented Jan 9, 2025 •

edited

Loading