Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deepspeed 容器环境下no-ssh多机多卡训练 #6685

Open
1 task done
Justin-12138 opened this issue Jan 17, 2025 · 1 comment
Open
1 task done

deepspeed 容器环境下no-ssh多机多卡训练 #6685

Justin-12138 opened this issue Jan 17, 2025 · 1 comment
Labels
bug Something isn't working pending This problem is yet to be addressed

Comments

@Justin-12138
Copy link

Reminder

  • I have read the above rules and searched the existing issues.

System Info

llamafactory==0.9.1.dev0
transformers==4.46.1
deepspeed==0.15.4

Reproduction

我想询问下关于容器环境下deepspeed伪多机多卡训练的相关事宜,
1:我这边训练的时候:做ssh免密的时候最终的结果是保存在hostfile第一行的节点,例如
node3   slots=1
node4    slots=1
然后需要在第一个节点执行相关的deepspeed脚本

最终的结果保存如下,悬链过程中没有出现任何问题:
Image
我现在想在k8s环境中实现相同的操作,应该怎么实现呢?我看deepspeed的官方文档的意思似乎是需要在每个节点执行一下运行脚本

deepspeed --hostfile=myhostfile --no_ssh --node_rank=<n> \
    --master_addr=<addr> --master_port=<port> \
    <client_entry.py> <client args> \
    --deepspeed --deepspeed_config ds_config.json

Others

No response

@Justin-12138 Justin-12138 added bug Something isn't working pending This problem is yet to be addressed labels Jan 17, 2025
@mikerain
Copy link

同问,请给一个mpirun在k8s下使用deepspeed的例子吗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pending This problem is yet to be addressed
Projects
None yet
Development

No branches or pull requests

2 participants