K8S 使用GPU 节点资源问题配置

发布时间 2023-09-20 15:47:10作者: MhaiM

1、GPU 节点必须安装NVIDIA 驱动,下载地址如下,并运行sh  安装

wget https://cn.download.nvidia.com/tesla/450.80.02/NVIDIA-Linux-x86_64-450.80.02.run

2、查看是否安装成功

nvidia-smi

3、安装nvidia-docker2.0工具,yum 源配置

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo

#更新缓存

yum clean all && yum makecache

4、安装nvidia-docker2软件包并重新加载Docker守护程序配置

#先备份daemon.json 防止之前的配置丢失

cp -a /etc/docker/daemon.json  /etc/docker/daemon.json.bak

yum install -y nvidia-docker2

pkill -SIGHUP dockerd

5、docker nvidia-container-runtime运行时配置

vim /etc/docker/daemon.json

"default-runtime": "nvidia",
     "runtimes": {
           "nvidia": {
               "path": "/usr/bin/nvidia-container-runtime",
               "runtimeArgs": []
}
}

6、重启docker 服务

systemctl daemon-reload

systemctl restart docker

7、部署 NVIDIA设备插件

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml

8、K8S 调度GPU 任务测试

cat gpu.yaml

apiVersion: v1
kind: Pod
metadata:
     name: tf-pod
spec:
    containers:
        - name: tf-container
           image: tensorflow/tensorflow:latest-gpu
           command: [ "/bin/sh" ]
           args: [ "-c", "while true; do echo hello; sleep 100;done" ]
           resources:
              limits:
              nvidia.com/gpu: 1 # requesting 1 GPUs

9、进pod 里面执行nvida-smi 是否有输出,有输出说明成功了

10、查看节点是否有gpu 资源

kubectl  describe nodes/work1 |grep nv