Skip to content

Commit

Permalink
Update base for Update on "[WIP][RFC] TorchFT integration"
Browse files Browse the repository at this point in the history
**Summary**
This is a WIP TorchFT integration PR.

**Current Issues**

This doesn't work at this moment as there are hanged groups when a new group joins. 

**Issue 1:**
~Group 0 and group 1 will hang during the first `should_commit` after group 1 applying the pending state_dict from group 0.~

Fixed with: pytorch/torchft#83

**Issue 2:**
~Group 0 and group 1 will pass the `should_commit` but group 0 needs healing which is wrong and the healing process will cause another hang.~

Fixed with: pytorch/torchft#83

**Issue 3:**
The byproduct of issue 1 and issue 2: group 1 will continue to print out
```
[rank0]:devgpu051:76838:80357 [0] misc/socket.cc:50 NCCL WARN socketProgress: Connection closed by remote peer devgpu051.cln3.svc.fbinfra.net<33618>
```

***How to reproduce?***
Using the following the steps in `Reproduce steps` to run 2 groups. Then kill any of the group after both start training. Remember to apply pytorch/torchft#83.

**Issue 4:**
When there are 3 groups, everyone requests the state dict every step.

***How to reproduce?***
Using the `Reproduce steps` to run 2 groups, then add another group by modifying the command. 

**Reproduce steps:**

1. Patch TorchFT with pytorch/torchft#82
2. Execute lighthouse
3. Execute the following command in one terminal:
```
TORCHFT_MANAGER_PORT=29520 REPLICA_GROUP_ID=0 CUDA_VISIBLE_DEVICES=0,1 NGPU=2 ./run_llama_train.sh --training.data_parallel_shard_degree=2 --experimental.enable_torchft --experimental.ft_replica_group_id=0
```
4. Wait 10 seconds, execute following command in another terminal:
```
TORCHFT_MANAGER_PORT=29522 REPLICA_GROUP_ID=1 CUDA_VISIBLE_DEVICES=2,3 NGPU=2 ./run_llama_train.sh --training.data_parallel_shard_degree=2 --experimental.enable_torchft --experimental.ft_replica_group_id=1
```



[ghstack-poisoned]
  • Loading branch information
fegin committed Jan 31, 2025
1 parent 16ecc46 commit f7ae033
Showing 0 changed files with 0 additions and 0 deletions.

0 comments on commit f7ae033

Please sign in to comment.