EvenChan's Ops.

记NVIDIA显卡A100在K8S POD中“Failed to initialize NVML:Unknown Error“问题解决

字数统计: 1.2k阅读时长: 5 min
2023/09/21

问题描述

因项目原因需要在k8s上跑GPU相关的代码,优选使用NVIDIA A100显卡,但在根据官方文档简单并部署 GitHub - NVIDIA/k8s-device-plugin:适用于 Kubernetes 的 NVIDIA 设备插件后,出现了pod中GPU运行一段时间后丢失的问题,进入容器后发现nvidia-smi命令报错”Failed to initialize NVML: Unknown Error”。尝试删除并且重建容器后,刚开始nvidia-smi命令正常,但是在大约10秒过后,重复出现以上异常。

问题分析

对于出现的问题,github中有多人提到,如:

nvidia-smi command in container returns “Failed to initialize NVML: Unknown Error” after couple of times · Issue #1678 · NVIDIA/nvidia-docker · GitHub

“Failed to initialize NVML: Unknown Error” after random amount of time · Issue #1671 · NVIDIA/nvidia-docker · GitHub

通过讨论可以发现,我们的现象与其他人是相同的,该命令失效的原因为一段时间后,devices.list中丢失了GPU的设备(路径:/sys/fs/cgroup/devices/devices.list)

导致问题的原因为k8s的cpu管理策略为static,并且修改cpu的管理策略为none,该问题确实可以解决,建议对CPU管理策略研究没有那么严格时,操作到此即可。 但是我们对于CPU的管理策略要求为static,所以我们继续追溯到github上以下issue。

Updating cpu-manager-policy=static causes NVML unknown error · Issue #966 · NVIDIA/nvidia-docker · GitHub

问题原因可以参考https://zhuanlan.zhihu.com/p/344561710

https://github.com/NVIDIA/nvidia-docker/issues/966#issuecomment-610928514 作者提到了解决方式,并且官方在几个版本之前提供了相关的解决方案,在部署官方插件的时候添加参数 –pass-device-specs=ture ,至此重新阅读官方部署文档,确实发现了相关参数的说明。但是在部署之后发现问题还是没有解决,再次阅读相关讨论,发现runc版本有限制 (https://github.com/NVIDIA/nvidia-docker/issues/1671#issuecomment-1330466432),我们的版本为1.14,再次对runc降级后,该问题解决。

解决步骤

  1. 检查runc版本,如果版本小于1.1.3可以直接跳到第3步操作:

    1
    2
    3
    4
    5
    6
    # runc -v
    runc version 1.1.4
    commit: v1.1.4-0-xxxxx
    spec: 1.0.2-dev
    go: go1.17.10
    libseccomp: 2.5.3
  2. 更新runc版本:

1
mv runc.amd64 runc && chmod +x runc
  • 备份原有的runc
1
mv /usr/bin/runc /home/runcbak
  • 停止docker
1
systemctl stop docker
  • 替换新版本runc
1
cp runc /usr/bin/runc
  • 启动docker
1
systemctl start docker
  • 检查runc是否升级成功
runc -v
1
2
3
4
5
runc version 1.1.2
commit: v1.1.2-0-ga916309f
spec: 1.0.2-dev
go: go1.17.10
libseccomp: 2.5.3
  1. 安装NVIDIA GPU插件
  • 创建plugin.yml,该yaml文件中跟普通部署的区别主要为PASS_DEVICE_SPECS

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    # You may obtain a copy of the License at
    #
    # http://www.apache.org/licenses/LICENSE-2.0
    #
    # Unless required by applicable law or agreed to in writing, software
    # distributed under the License is distributed on an "AS IS" BASIS,
    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    # See the License for the specific language governing permissions and
    # limitations under the License.

    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
    name: nvidia-device-plugin-daemonset
    namespace: kube-system
    spec:
    selector:
    matchLabels:
    name: nvidia-device-plugin-ds
    updateStrategy:
    type: RollingUpdate
    template:
    metadata:
    labels:
    name: nvidia-device-plugin-ds
    spec:
    tolerations:
    - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
    # Mark this pod as a critical add-on; when enabled, the critical add-on
    # scheduler reserves resources for critical add-on pods so that they can
    # be rescheduled after a failure.
    # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
    priorityClassName: "system-node-critical"
    containers:
    - image: nvcr.io/nvidia/k8s-device-plugin:v0.13.0
    name: nvidia-device-plugin-ctr
    env:
    - name: FAIL_ON_INIT_ERROR
    value: "false"
    - name: PASS_DEVICE_SPECS
    value: "true"
    securityContext:
    privileged: true
    volumeMounts:
    - name: device-plugin
    mountPath: /var/lib/kubelet/device-plugins
    volumes:
    - name: device-plugin
    hostPath:
    path: /var/lib/kubelet/device-plugins
  • 创建插件

1
$ kubectl create -f plugin.yml
  1. 创建GPU POD并且验证


SEO切换cpu管理策略

  1. 关闭kubelet
1
systemctl stop kubelet
  1. 删除cpu_manager_state
1
rm /var/lib/kubelet/cpu_manager_state
  1. 修改config.yaml
/var/lib/kubelet/config.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55

apiVersion: kubelet.config.k8s.io/v1beta1
authentication:
anonymous:
enabled: false
webhook:
cacheTTL: 0s
enabled: true
x509:
clientCAFile: /etc/kubernetes/pki/ca.crt
authorization:
mode: Webhook
webhook:
cacheAuthorizedTTL: 0s
cacheUnauthorizedTTL: 0s
cgroupDriver: systemd
clusterDNS:
- 10.96.0.10
clusterDomain: cluster.local

# 修改cpu管理策略,none或者static
cpuManagerPolicy: static

cpuManagerReconcilePeriod: 0s
evictionPressureTransitionPeriod: 0s
featureGates:
TopologyManager: true
fileCheckFrequency: 0s
healthzBindAddress: 127.0.0.1
healthzPort: 10248
httpCheckFrequency: 0s
imageMinimumGCAge: 0s
kind: KubeletConfiguration
logging: {}
memorySwap: {}
nodeStatusReportFrequency: 0s
nodeStatusUpdateFrequency: 0s
podPidsLimit: 4096
reservedSystemCPUs: 0,1
resolvConf: /run/systemd/resolve/resolv.conf
rotateCertificates: true
runtimeRequestTimeout: 0s
shutdownGracePeriod: 0s
shutdownGracePeriodCriticalPods: 0s
staticPodPath: /etc/kubernetes/manifests
streamingConnectionIdleTimeout: 0s
syncFrequency: 0s
tlsCipherSuites:
- TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
- TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
- TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
- TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
tlsMinVersion: VersionTLS12
topologyManagerPolicy: best-effort
volumeStatsAggPeriod: 0s
  1. 启动kubelet

systemctl start kubelet

变更containerd版本

https://github.com/NVIDIA/nvidia-docker/issues/1671#issuecomment-1238644201

参考https://blog.csdn.net/Ivan_Wz/article/details/111932120

  1. github下载二进制containerd(https://github.com/containerd/containerd/releases/tag/v1.6.16)

  2. 解压containerd

tar -zxvf containerd-1.6.16-linux-amd64.tar.gz

  1. 检查当前containerd版本
1
2
docker info 
containerd -v
  1. 暂停docker
1
systemctl stop docker

5.替换containerd二进制文件

1
2
3
4
5
cp containerd /usr/bin/containerd
cp containerd-shim /usr/bin/containerd-shim
cp containerd-shim-runc-v1 /usr/bin/containerd-shim-runc-v1
cp containerd-shim-runc-v2 /usr/bin/containerd-shim-runc-v2
cp ctr /usr/bin/ctr

6.重启docker 检查containerd版本是否替换成功

CATALOG
  1. 1. 问题描述
  2. 2. 问题分析
  3. 3. 解决步骤