概述
关注:决策智能与机器学习,深耕AI脱水干货
分布式TensorFlow 神经网络训练基准测试参考 驱动、内核软件、训练框架和集群通信软件准备 网络、服务器和容器平台配置 通过NCCL和Horovod集群通信框架,分布式运行集群训练任务 https://docs.mellanox.com/pages/releaseview.action?pageId=15049828 https://docs.mellanox.com/pages/releaseview.action?pageId=15049840
更多 AI Benchmark Reference Deployment Guide
▪ TensorFlow solutions on https://community.mellanox.com/s/topic/0TO50000000g1umGAA/tensorflow
▪ Reference Deployment Guide for RDMA over Ethernet (RoCE) accelerated TensorFlow 1.6 with an NVIDIA GPU Card over Mellanox 100 GbE Network https://community.mellanox.com/s/article/reference-deployment-guide-for-rdma-over-ethernet-roce--accelerated-tensorflow-1-6-with-an-nvidia-gpu-card-over-mellanox-100-gbe-network
▪ RDG for distributed, dockerised, RDMA accelerated Horovod training framework on HPE Apollo 6500 servers and 100Gb InfiniBand fabric https://docs.mellanox.com/pages/releaseview.action?pageId=15049840 ▪ How To build and run RDMA / RoCE accelerated Horovod framework Docker https://docs.mellanox.com/pages/releaseview.action?pageId=15049724 ▪ RDG for Accelerated ML and HPC Applications on K8s Cluster over Ethernet with RoCE https://docs.mellanox.com/pages/releaseview.action?pageId=15049828
原版文档下载链接, (HPC + GPU DRMA + AI DIST)
公众号内回复:0506
交流合作
请加微信号:yan_kylin_phenix,注明姓名+单位+从业方向+地点,非诚勿扰。
最后
以上就是敏感狗为你收集整理的分布式AI训练软件栈&硬件栈技术详解的全部内容,希望文章能够帮你解决分布式AI训练软件栈&硬件栈技术详解所遇到的程序开发问题。
如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。
发表评论 取消回复