Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add P2P distributed optimization to advanced examples #3189

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

francescofarina
Copy link

Description

This PR adds a new set of advanced examples in examples/advanced/distributed_optimization, showing how to use the lower-level APIs to build P2P distributed optimization algorithms.

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).

@francescofarina
Copy link
Author

@chesterxgchen I implemented your suggested changes:

  • I moved the implementation to app_opt - I changed the name of the module nvdo to p2p as the classes can potentially be used for arbitrary p2p algos, not just distributed optimization. Happy to change the name though
  • added documentation to everything that's now in app_opt
  • moved the SyncAlgorithmExecutor to a separate file and renamed the base.py files to base_p2p_executor.py for the executor and p2p_controller.py for the controller. The BaseP2PAlgorithmExecutor is now an ABC.
  • removed pickle. For convenience right now all the executors are saving the results with pytorch.save, but that could be removed and easily reimplemented by the user if needed
  • the dependencies in the examples/advanced/distributed_optimization are now specified in a requirements.txt as the core core has been moved to app_opt and can be imported directly from nvflare

Let me know what you think.

Now that it's moved to the core, I feel implementation part could be changed/improved by offloading to the user things like saving the results, storing losses via callbacks, monitoring, etc - perhaps it makes more sense to do that at a later stage though

@holgerroth
Copy link
Collaborator

Tested locally and runs fine. The tutorial is great! Should consider adding some CI testing in a future PR.

# Store the received value in the neighbors_values dictionary
self.neighbors_values[iteration][sender] = self._from_message(data["value"])
# Check if all neighbor values have been received for the iteration
if len(self.neighbors_values[iteration]) >= len(self.neighbors):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we reset the neighbors_values once we have all of them for next round ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to maintain values per iteration and delete the values after they're consumed in the algorithm. More details in examples/advanced/distributed_optimization/README.md

neighbors_values needs to maintain a dictionary of received values per iteration. This is because, different parts of a network may be at different iterations of the algorithm (plus or minus 1 at most) - this means that I could receive a message from a neighbor valid for iteration t+1 when I'm still at iteration t. Since that message won't be sent again, I need to store it. To avoid the neighbors_values to grow indefinitely, we'll delete its content at iteration t after having consumed its values and moving to the next iteration in the algorithm. We'll see that in the next section.

Copy link
Collaborator

@chesterxgchen chesterxgchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a few more comments

@francescofarina
Copy link
Author

add a few more comments

implemented renaming of dist opt controllers and executors + added synchronization timeout as parameter. I also re-run all the examples on GPU and works just fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants