We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llamafactory==0.9.1.dev0 transformers==4.46.1 deepspeed==0.15.4
我想询问下关于容器环境下deepspeed伪多机多卡训练的相关事宜, 1:我这边训练的时候:做ssh免密的时候最终的结果是保存在hostfile第一行的节点,例如 node3 slots=1 node4 slots=1 然后需要在第一个节点执行相关的deepspeed脚本
最终的结果保存如下,悬链过程中没有出现任何问题: 我现在想在k8s环境中实现相同的操作,应该怎么实现呢?我看deepspeed的官方文档的意思似乎是需要在每个节点执行一下运行脚本
deepspeed --hostfile=myhostfile --no_ssh --node_rank=<n> \ --master_addr=<addr> --master_port=<port> \ <client_entry.py> <client args> \ --deepspeed --deepspeed_config ds_config.json
No response
The text was updated successfully, but these errors were encountered:
同问,请给一个mpirun在k8s下使用deepspeed的例子吗
Sorry, something went wrong.
No branches or pull requests
Reminder
System Info
llamafactory==0.9.1.dev0
transformers==4.46.1
deepspeed==0.15.4
Reproduction
最终的结果保存如下,悬链过程中没有出现任何问题:
我现在想在k8s环境中实现相同的操作,应该怎么实现呢?我看deepspeed的官方文档的意思似乎是需要在每个节点执行一下运行脚本
Others
No response
The text was updated successfully, but these errors were encountered: