Add P2P distributed optimization to advanced examples #3189

francescofarina · 2025-01-29T11:02:01Z

Description

This PR adds a new set of advanced examples in examples/advanced/distributed_optimization, showing how to use the lower-level APIs to build P2P distributed optimization algorithms.

Types of changes

Non-breaking change (fix or new feature that would not break existing functionality).

examples/advanced/distributed_optimization/pyproject.toml

examples/advanced/distributed_optimization/nvdo/controllers/base.py

examples/advanced/distributed_optimization/nvdo/executors/base.py

examples/advanced/distributed_optimization/nvdo/executors/consensus.py

francescofarina · 2025-01-31T11:54:16Z

@chesterxgchen I implemented your suggested changes:

I moved the implementation to app_opt - I changed the name of the module nvdo to p2p as the classes can potentially be used for arbitrary p2p algos, not just distributed optimization. Happy to change the name though
added documentation to everything that's now in app_opt
moved the SyncAlgorithmExecutor to a separate file and renamed the base.py files to base_p2p_executor.py for the executor and p2p_controller.py for the controller. The BaseP2PAlgorithmExecutor is now an ABC.
removed pickle. For convenience right now all the executors are saving the results with pytorch.save, but that could be removed and easily reimplemented by the user if needed
the dependencies in the examples/advanced/distributed_optimization are now specified in a requirements.txt as the core core has been moved to app_opt and can be imported directly from nvflare

Let me know what you think.

Now that it's moved to the core, I feel implementation part could be changed/improved by offloading to the user things like saving the results, storing losses via callbacks, monitoring, etc - perhaps it makes more sense to do that at a later stage though

examples/advanced/distributed_optimization/README.md

holgerroth · 2025-01-31T20:01:06Z

Tested locally and runs fine. The tutorial is great! Should consider adding some CI testing in a future PR.

test locally

nvflare/app_opt/p2p/controllers/p2p_controller.py

nvflare/app_opt/p2p/executors/base_p2p_executor.py

nvflare/app_opt/p2p/executors/sync_executor.py

chesterxgchen · 2025-02-02T22:48:03Z

nvflare/app_opt/p2p/executors/sync_executor.py

+            # Store the received value in the neighbors_values dictionary
+            self.neighbors_values[iteration][sender] = self._from_message(data["value"])
+            # Check if all neighbor values have been received for the iteration
+            if len(self.neighbors_values[iteration]) >= len(self.neighbors):


do we reset the neighbors_values once we have all of them for next round ?

We need to maintain values per iteration and delete the values after they're consumed in the algorithm. More details in examples/advanced/distributed_optimization/README.md

neighbors_values needs to maintain a dictionary of received values per iteration. This is because, different parts of a network may be at different iterations of the algorithm (plus or minus 1 at most) - this means that I could receive a message from a neighbor valid for iteration t+1 when I'm still at iteration t. Since that message won't be sent again, I need to store it. To avoid the neighbors_values to grow indefinitely, we'll delete its content at iteration t after having consumed its values and moving to the next iteration in the algorithm. We'll see that in the next section.

chesterxgchen

add a few more comments

francescofarina · 2025-02-03T13:02:07Z

add a few more comments

implemented renaming of dist opt controllers and executors + added synchronization timeout as parameter. I also re-run all the examples on GPU and works just fine.

francescofarina added 3 commits January 29, 2025 10:58

p2p distributed optimization examples

d59f49b

add distributed optimization to examples readme

7406ac1

update data kind in messages

1409de9

holgerroth requested review from chesterxgchen, YuanTingHsieh, yanchengnv, nvidianz and nvkevlu January 29, 2025 18:13

yanchengnv requested review from holgerroth and ZiyueXu77 January 30, 2025 15:49

francescofarina added 2 commits January 30, 2025 16:31

add license text

3d1e710

fix broken link

b597afe

chesterxgchen reviewed Jan 30, 2025

View reviewed changes

examples/advanced/distributed_optimization/pyproject.toml Outdated Show resolved Hide resolved

chesterxgchen reviewed Jan 30, 2025

View reviewed changes

examples/advanced/distributed_optimization/nvdo/controllers/base.py Outdated Show resolved Hide resolved

chesterxgchen reviewed Jan 30, 2025

View reviewed changes

examples/advanced/distributed_optimization/nvdo/executors/base.py Outdated Show resolved Hide resolved

chesterxgchen reviewed Jan 30, 2025

View reviewed changes

examples/advanced/distributed_optimization/nvdo/executors/consensus.py Outdated Show resolved Hide resolved

francescofarina added 2 commits January 31, 2025 11:07

move nvdo to nvflare.app_opt.p2p + refactoring

e1d394e

add docs

6b92082

abc algo executor

bdfb3b8

holgerroth reviewed Jan 31, 2025

View reviewed changes

examples/advanced/distributed_optimization/README.md Outdated Show resolved Hide resolved

francescofarina and others added 3 commits January 31, 2025 16:53

typo in example README

72c55a9

reload notebook

9e401f0

test locally

89df860

francescofarina added 2 commits January 31, 2025 20:14

Merge pull request #1 from holgerroth/p2p_minor_updates

0384bca

test locally

select device in executors

954c952

chesterxgchen reviewed Feb 2, 2025

View reviewed changes

nvflare/app_opt/p2p/controllers/p2p_controller.py Outdated Show resolved Hide resolved

chesterxgchen reviewed Feb 2, 2025

View reviewed changes

nvflare/app_opt/p2p/executors/base_p2p_executor.py Outdated Show resolved Hide resolved

chesterxgchen reviewed Feb 2, 2025

View reviewed changes

nvflare/app_opt/p2p/executors/sync_executor.py Outdated Show resolved Hide resolved

chesterxgchen reviewed Feb 2, 2025

View reviewed changes

chesterxgchen and others added 4 commits February 2, 2025 14:49

Merge branch 'main' into francescofarina/distributed-optimization

03f7c50

renaming + parametric timout

2b09b05

rerun experiments

4611c91

update example README

fd3b74f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add P2P distributed optimization to advanced examples #3189

Add P2P distributed optimization to advanced examples #3189

francescofarina commented Jan 29, 2025

francescofarina commented Jan 31, 2025

holgerroth commented Jan 31, 2025

chesterxgchen Feb 2, 2025

francescofarina Feb 3, 2025

chesterxgchen left a comment

francescofarina commented Feb 3, 2025

Add P2P distributed optimization to advanced examples #3189

Are you sure you want to change the base?

Add P2P distributed optimization to advanced examples #3189

Conversation

francescofarina commented Jan 29, 2025

Description

Types of changes

francescofarina commented Jan 31, 2025

holgerroth commented Jan 31, 2025

chesterxgchen Feb 2, 2025

Choose a reason for hiding this comment

francescofarina Feb 3, 2025

Choose a reason for hiding this comment

chesterxgchen left a comment

Choose a reason for hiding this comment

francescofarina commented Feb 3, 2025