关于M40计算卡那些事

发布时间 2024-01-02 15:37:16作者: azureology

SEO

使用sensors查看主板温度
linux下控制PWM风扇转速
获取nvidia显卡核心温度

背景

因为模型训练需要20G显存,上云前希望能有一个本地debug环境。
正经RTX显卡太贵,Tesla M40 24G计算卡物美价廉,CUDA少不是问题。

安装

和正常显卡一样,安装在PCIe x16插槽即可使用,注意CPU必须有核显。
显卡没有显示输出,连接双8Pin供电后可以正常检测并安装NV驱动。
Official Drivers | NVIDIA
(过程和RTX系列显卡驱动一样,多了一项CUDA版本选择)
无法识别可以尝试再主板BIOS中打开Above 4G decoding选项就能正常识别。

散热

由于Tesla定位数据中心服务器计算卡,依靠机箱气流散热因此没有主动散热风扇需要自行加装。
我选择了3D打印导风板加装两个服务器暴力风扇(16000PRM)进行PWM调速。
查看温度相关信息

# info from nvidia-smi -q -a
Temperature
    GPU Current Temp                  : 48 C
    GPU T.Limit Temp                  : N/A
    GPU Shutdown Temp                 : 92 C
    GPU Slowdown Temp                 : 89 C
    GPU Max Operating Temp            : N/A
    GPU Target Temperature            : 87 C
    Memory Current Temp               : N/A
    Memory Max Operating Temp         : N/A

根据信息,显卡温度维持在87度以下是比较推荐的,超过89开始降频。

温控

温度

由于显卡没有风扇无法自行调节转速,需要手动获取显卡温度并计算相应风扇转速。
显卡温度可以通过一句命令获取

nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader

如果觉得速度较慢(12ms)还可以直接调用nvml库进行获取。

转速

获取转速主要看主板芯片组的兼容情况,大多数家用主板都无法在linux下直接获取风扇转速或者仅提供Windows控制软件。我手上的华硕Z370-P II就只能在BIOS里调整QFAN温度曲线,而显卡风扇连接在CHASSIS FAN上关联的温度是错误的机箱温度,因此只能手动控制。
安装lm-sensorsfancontrol无法使用pwmconfig对风扇进行识别。

/usr/sbin/pwmconfig: There are no pwm-capable sensor modules installed

根据SO的建议,后手动修改内核参数后重启能够识别到PWM设备。

sudo sed -E -i 's/(GRUB_CMDLINE_LINUX_DEFAULT=.+)"$/\1 acpi_enforce_resources=lax"/' /etc/default/grub
sudo update-grub
sudo reboot

识别效果

nct6793-isa-0290
Adapter: ISA adapter
in0:                      632.00 mV (min =  +0.00 V, max =  +1.74 V)
in1:                        1.02 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in2:                        3.41 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in3:                        3.38 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in4:                        1.02 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in5:                      160.00 mV (min =  +0.00 V, max =  +0.00 V)  ALARM
in6:                      864.00 mV (min =  +0.00 V, max =  +0.00 V)  ALARM
in7:                        3.41 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in8:                        3.14 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in9:                        1.02 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in10:                     864.00 mV (min =  +0.00 V, max =  +0.00 V)  ALARM
in11:                     864.00 mV (min =  +0.00 V, max =  +0.00 V)  ALARM
in12:                       1.02 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in13:                       1.02 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in14:                     416.00 mV (min =  +0.00 V, max =  +0.00 V)  ALARM
fan1:                     1121 RPM  (min =    0 RPM)
fan2:                     1053 RPM  (min =    0 RPM)
fan3:                     4141 RPM  (min =    0 RPM)
fan5:                     1330 RPM  (min =    0 RPM)
fan6:                        0 RPM  (min =    0 RPM)
SYSTIN:                    +33.0°C  (high = +98.0°C, hyst = +95.0°C)  sensor = thermistor
CPUTIN:                    +41.0°C  (high = +80.0°C, hyst = +75.0°C)  sensor = thermistor
AUXTIN0:                   +33.5°C    sensor = thermistor
AUXTIN1:                   +33.0°C    sensor = thermistor
AUXTIN2:                   +33.0°C    sensor = thermistor
AUXTIN3:                   +89.0°C    sensor = thermistor
PECI Agent 0:              +66.0°C  (high = +98.0°C, hyst = +95.0°C)
                                    (crit = +100.0°C)
PECI Agent 0 Calibration:  +54.0°C  
PCH_CHIP_CPU_MAX_TEMP:      +0.0°C  
PCH_CHIP_TEMP:              +0.0°C  
intrusion0:               ALARM
intrusion1:               ALARM
beep_enable:              disabled

使用pwmconfig对风扇进行识别,通过短暂停止风扇对PWM设备进行关联。
最终生成/etc/fancontrol配置文件,语法参考fancontrol(8) - Linux man page

$ ssh alloy cat /etc/fancontrol
# Configuration file generated by pwmconfig, changes will be lost
INTERVAL=10
DEVPATH=hwmon0=devices/virtual/thermal/thermal_zone0 hwmon3=devices/platform/nct6775.656
DEVNAME=hwmon0=acpitz hwmon3=nct6793
FCTEMPS=hwmon3/pwm3=hwmon0/temp1_input
FCFANS= hwmon3/pwm3=hwmon3/fan3_input
MINTEMP=hwmon3/pwm3=60
MAXTEMP=hwmon3/pwm3=85
MINSTART=hwmon3/pwm3=150
MINSTOP=hwmon3/pwm3=100
MINPWM=hwmon3/pwm3=63

我的风扇是FAN3注意不能直接照抄配置文件。可视化便于理解

A graph might help you understand how the different values relate
to each other:

    PWM ^
    255 +
        |
        |
        |                             ,-------------- MAXPWM
        |                           ,'.
        |                         ,'  .
        |                       ,'    .
        |                     ,'      .
        |                   ,'        .
        |                 ,'          .
        |       MINSTOP .'            .
        |               |             .
        |               |             .
        |               |             .
 MINPWM |---------------'             .
        |               .             .
        |               .             .
        |               .             .
      0 +---------------+-------------+---------------->
                     MINTEMP       MAXTEMP            t (degree C)

魔改

最后一步就是关联核心温度和风扇转速了,我魔改了fancontrol的源代码,按需follow!
修改在line 36 - 39

# /usr/sbin/fancontrol
function UpdateFanSpeeds
{
        local fcvcount
        local pwmo tsens fan mint maxt minsa minso minpwm maxpwm
        local tval tlastval pwmpval fanval min_fanval one_fan one_fanval
        local -i pwmval

        let fcvcount=0
        while (( $fcvcount < ${#AFCPWM[@]} )) # go through all pwm outputs
        do
                #hopefully shorter vars will improve readability:
                pwmo=${AFCPWM[$fcvcount]}
                tsens=${AFCTEMP[$fcvcount]}
                fan=${AFCFAN[$fcvcount]}
                let mint="${AFCMINTEMP[$fcvcount]}*1000"
                let maxt="${AFCMAXTEMP[$fcvcount]}*1000"
                minsa=${AFCMINSTART[$fcvcount]}
                minso=${AFCMINSTOP[$fcvcount]}
                minpwm=${AFCMINPWM[$fcvcount]}
                maxpwm=${AFCMAXPWM[$fcvcount]}
                avg=${AFCAVERAGE[$fcvcount]}

                #read tlastval < ${tsens}
                # hardcode GPU temp start
                let tlastval=$(nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader)*1000
                # hardcode GPU temp end
                if [ $? -ne 0 ]
                then
                        echo "Error reading temperature from $DIR/$tsens"
                        restorefans 1
                fi

完成后重启服务生效sudo service fancontrol restart

参考

How to see the Video Card Temperature (Nvidia, ATI, Intel...) - Ask Ubuntu
NVIDIA Management Library (NVML) | NVIDIA Developer
usr/sbin/pwmconfig: There are no pwm-capable sensor modules installed MSI | ubuntu 16.04 fancontrol - Stack Overflow
Cippo95/nvidia-fan-control: Controlling fans on my NVIDIA graphics card
fancontrol(8) - Linux man page
lm-sensors/doc/fancontrol.txt at master · lm-sensors/lm-sensors