1、GPU 节点必须安装NVIDIA 驱动,下载地址如下,并运行sh 安装
wget https://cn.download.nvidia.com/tesla/450.80.02/NVIDIA-Linux-x86_64-450.80.02.run
2、查看是否安装成功
nvidia-smi
3、安装nvidia-docker2.0工具,yum 源配置
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
#更新缓存
yum clean all && yum makecache
4、安装nvidia-docker2软件包并重新加载Docker守护程序配置
#先备份daemon.json 防止之前的配置丢失
cp -a /etc/docker/daemon.json /etc/docker/daemon.json.bak
yum install -y nvidia-docker2
pkill -SIGHUP dockerd
5、docker nvidia-container-runtime运行时配置
vim /etc/docker/daemon.json
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
6、重启docker 服务
systemctl daemon-reload
systemctl restart docker
7、部署 NVIDIA设备插件
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml
8、K8S 调度GPU 任务测试
cat gpu.yaml
apiVersion: v1
kind: Pod
metadata:
name: tf-pod
spec:
containers:
- name: tf-container
image: tensorflow/tensorflow:latest-gpu
command: [ "/bin/sh" ]
args: [ "-c", "while true; do echo hello; sleep 100;done" ]
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPUs
9、进pod 里面执行nvida-smi 是否有输出,有输出说明成功了
10、查看节点是否有gpu 资源
kubectl describe nodes/work1 |grep nv