问题描述
因项目原因需要在k8s上跑GPU相关的代码,优选使用NVIDIA A100显卡,但在根据官方文档简单并部署 GitHub - NVIDIA/k8s-device-plugin:适用于 Kubernetes 的 NVIDIA 设备插件后,出现了pod中GPU运行一段时间后丢失的问题,进入容器后发现nvidia-smi命令报错”Failed to initialize NVML: Unknown Error”。尝试删除并且重建容器后,刚开始nvidia-smi命令正常,但是在大约10秒过后,重复出现以上异常。
问题分析
对于出现的问题,github中有多人提到,如:
nvidia-smi command in container returns “Failed to initialize NVML: Unknown Error” after couple of times · Issue #1678 · NVIDIA/nvidia-docker · GitHub
“Failed to initialize NVML: Unknown Error” after random amount of time · Issue #1671 · NVIDIA/nvidia-docker · GitHub
通过讨论可以发现,我们的现象与其他人是相同的,该命令失效的原因为一段时间后,devices.list中丢失了GPU的设备(路径:/sys/fs/cgroup/devices/devices.list)
导致问题的原因为k8s的cpu管理策略为static,并且修改cpu的管理策略为none,该问题确实可以解决,建议对CPU管理策略研究没有那么严格时,操作到此即可。 但是我们对于CPU的管理策略要求为static,所以我们继续追溯到github上以下issue。
Updating cpu-manager-policy=static causes NVML unknown error · Issue #966 · NVIDIA/nvidia-docker · GitHub
问题原因可以参考https://zhuanlan.zhihu.com/p/344561710
在 https://github.com/NVIDIA/nvidia-docker/issues/966#issuecomment-610928514
作者提到了解决方式,并且官方在几个版本之前提供了相关的解决方案,在部署官方插件的时候添加参数 –pass-device-specs=ture ,至此重新阅读官方部署文档,确实发现了相关参数的说明。但是在部署之后发现问题还是没有解决,再次阅读相关讨论,发现runc版本有限制 (https://github.com/NVIDIA/nvidia-docker/issues/1671#issuecomment-1330466432)
,我们的版本为1.14,再次对runc降级后,该问题解决。
解决步骤
检查runc版本,如果版本小于1.1.3可以直接跳到第3步操作:
1
2
3
4
5
6# runc -v
runc version 1.1.4
commit: v1.1.4-0-xxxxx
spec: 1.0.2-dev
go: go1.17.10
libseccomp: 2.5.3更新runc版本:
下载指定版本的runc版本,本文下载的为1.1.2版本(https://github.com/opencontainers/runc/releases/tag/v1.1.2)
将下载好的runc.amd64文件上传到服务器、修改文件名并赋权
1 | mv runc.amd64 runc && chmod +x runc |
- 备份原有的runc
1 | mv /usr/bin/runc /home/runcbak |
- 停止docker
1 | systemctl stop docker |
- 替换新版本runc
1 | cp runc /usr/bin/runc |
- 启动docker
1 | systemctl start docker |
- 检查runc是否升级成功
1 | runc version 1.1.2 |
- 安装NVIDIA GPU插件
创建plugin.yml,该yaml文件中跟普通部署的区别主要为PASS_DEVICE_SPECS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
updateStrategy:
type: RollingUpdate
template:
metadata:
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
# Mark this pod as a critical add-on; when enabled, the critical add-on
# scheduler reserves resources for critical add-on pods so that they can
# be rescheduled after a failure.
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
priorityClassName: "system-node-critical"
containers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.13.0
name: nvidia-device-plugin-ctr
env:
- name: FAIL_ON_INIT_ERROR
value: "false"
- name: PASS_DEVICE_SPECS
value: "true"
securityContext:
privileged: true
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins创建插件
1 | $ kubectl create -f plugin.yml |
- 创建GPU POD并且验证
附
SEO切换cpu管理策略
- 关闭kubelet
1 | systemctl stop kubelet |
- 删除cpu_manager_state
1 | rm /var/lib/kubelet/cpu_manager_state |
- 修改config.yaml
1 |
|
- 启动kubelet
systemctl start kubelet
变更containerd版本
https://github.com/NVIDIA/nvidia-docker/issues/1671#issuecomment-1238644201
参考https://blog.csdn.net/Ivan_Wz/article/details/111932120
github下载二进制containerd(https://github.com/containerd/containerd/releases/tag/v1.6.16)
解压containerd
tar -zxvf containerd-1.6.16-linux-amd64.tar.gz
- 检查当前containerd版本
1 | docker info |
- 暂停docker
1 | systemctl stop docker |
5.替换containerd二进制文件
1 | cp containerd /usr/bin/containerd |
6.重启docker 检查containerd版本是否替换成功