内存OOM机制

发布时间 2023-09-14 17:45:43作者: 逃亡的布丁

OOM原理简介

当系统发生OOM的时候,根据panic_on_oom配置,走系统奔溃还是杀进程
panic_on_oom=0:杀进程,此时根据oom_kill_allocating_task的配置选择进程赴死
oom_kill_allocating_task=0,扫描所有进程,根据算法对进程打分,分高者赴死,此时可以通过oom_score_adj选项控制进程oom_score,手动干预算法。
早期选项(已失效):文件在/proc//oom_adj。范围是[-17 ~ 15],数值越大表示越容易被oom killer杀死。如果进程的oom_adj配置为-17,表示进程禁止被OOM killer杀死。
现在选项:文件在/proc//oom_score_adj。范围是[-1000 ~ 1000],数值越大表示越容易被oom killer杀死。oom_score_adj=-1000,表示完全禁止进程被oom杀死。
oom_kill_allocating_task非0,直接杀死触发OOM的进程;
panic_on_oom=1:见下文
panic_on_oom=2:系统奔溃
当系统发生OOM的时候,通过oom_dump_tasks可以配置OOM时进程转储

内核参数简介

panic_on_oom

当Linux发生out of memory的时候,会根据panic_on_oom的配置,启用或禁用panic机制。

This enables or disables panic on out-of-memory feature.

If this is set to 0, the kernel will kill some rogue process, called oom_killer. Usually, oom_killer can kill rogue processes and system will survive.

If this is set to 1, the kernel panics when out-of-memory happens. However, if a process limits using nodes by mempolicy/cpusets, and those nodes become memory exhaustion status, one process may be killed by oom-killer. No panic occurs in this case. Because other nodes' memory may be free. This means system total status may be not fatal yet.

If this is set to 2, the kernel panics compulsorily even on the above-mentioned. Even oom happens under memory cgroup, the whole system panics.

The default value is 0. 1 and 2 are for failover of clustering. Please select either according to your policy of failover.

panic_on_oom=2+kdump gives you very strong tool to investigate why oom happens. You can get snapshot.

panic_on_oom的默认值是0,此时发生OOM,则会杀进程让系统不至于崩溃;
如果panic_on_oom的值设置为1,OOM的时候系统会崩溃死机;
但是如果此时触发OOM的进程是跑在通过mempolicy/cpusets限制资源的节点上,那么这些节点会变成资源耗尽的状态,这时候系统不会崩溃死机,而是会走oom_killer机制杀进程;
如果panic_on_oom的值设置为2,就算是节点做了资源限制,也会导致kernel panic;
panic_on_oom的默认值是0,当系统发生OOM的时候,会杀进程让系统存活下来。那此时系统是怎么杀进程的?随便找一个进程杀死吗?看第二个选项“oom_kill_allocating_task”。

oom_kill_allocating_task

This enables or disables killing the OOM-triggering task in out-of-memory situations.

If this is set to zero, the OOM killer will scan through the entire tasklist and select a task based on heuristics to kill. This normally selects a rogue memory-hogging task that frees up a large amount of memory when killed.

If this is set to non-zero, the OOM killer simply kills the task that triggered the out-of-memory condition. This avoids the expensive tasklist scan.

If panic_on_oom is selected, it takes precedence over whatever value is used in oom_kill_allocating_task.

The default value is 0.

oom_kill_allocating_task这个选项的配置,会在系统OOM的情形下选择什么样的进程被oom_killer杀死。

oom_kill_allocating_task默认值是0,此时会扫描所有进程,根据算法给进程打分,最后选择一个oom_score最大的进程赴死(启动时间短但是又占用大量内存);
oom_kill_allocating_task的值非零,此时直接杀死触发OOM的进程;
如果panic_on_oom的配置不为0,那么oom_kill_allocating_task的配置就没什么用了;

oom_score

通过/proc//oom_score文件,你可以看到进程的oom打分。早期版本的Linux系统,你可以通过/pro//oom_adj调整oom_score的分数,现在这个选项已经被弃用了(2012年),换成oom_score_adj。

《Taming the OOM killer》:https://lwn.net/Articles/317814/


The process to be killed in an out-of-memory situation is selected based on its badness score. The badness score is reflected in /proc/<pid>/oom_score. This value is determined on the basis that the system loses the minimum amount of work done, recovers a large amount of memory, doesn't kill any innocent process eating tons of memory, and kills the minimum number of processes (if possible limited to one). The badness score is computed using the original memory size of the process, its CPU time (utime + stime), the run time (uptime - start time) and its oom_adj value. The more memory the process uses, the higher the score. The longer a process is alive in the system, the smaller the score.

/proc/<pid>/oom_score is a dynamic value which changes with time, and is not flexible with different and dynamic policies required by the administrator. It is difficult to determine which process will be killed in case of an OOM condition. The administrator must adjust the score for every process created, and for every process which exits. 
Important processes, such as database processes and their controllers, can be added to this group, so they are ignored when OOM-killer searches for processes to be killed. All children of the processes listed in tasks automatically are added to the same control group and inherit the oom.priority of the parent. When multiple tasks have the highest oom.priority, the OOM killer selects the process based on the oom_score and oom_adj.

关于oom_score的计算,见下图
img

原则上:杀尽量少的进程(如果可能,限制为一个),又能释放大量内存而系统的工作量损失最少,但是不会杀死占用大量内存的无辜进程;
The badness score is computed using the original memory size of the process, its CPU time (utime + stime), the run time (uptime - start time) and its oom_adj value.
进程占用越多内存,得分越高;进程存活时间越长,得分越低
结论:所以oom_killer会优先选择启动时间短而又占用大量内存的进程赴死。

oom_adj

早期系统管理员用于控制进程oom_score的参数,文件在/proc//oom_adj。范围是[-17 ~ 15],数值越大表示越容易被oom killer杀死。如果进程的oom_adj配置为-17,表示进程禁止被OOM killer杀死。

oom_score_adj

oom_score_adj用于替代oom_adj,现在系统管理员用于控制进程oom_score的参数,文件在/proc//oom_score_adj。范围是[-1000 ~ 1000],数值越大表示越容易被oom killer杀死。oom_score_adj=-1000,表示完全禁止进程被oom杀死。

oom_dump_tasks

这个选项开启系统OOM时的进程转储,默认是1表示开启。

  • oom_dump_tasks=0:表示不开启OOM时进程转储;
  • oom_dump_tasks=1:默认值,表示开启OOM时的进程转储;

测试方法

测试思路

创建三个进程,触发OOM,根据进程赴死的前后顺序证明OOM

测试用例

  1. 优先选择启动时间短而又占用大量内存的进程赴死
步骤
1. 查看系统是否支持OOM回收
# /proc/sys/vm/oom_kill_allocating_task
值应该为1,值为0时执行
# echo 1 >   /proc/sys/vm/oom_kill_allocating_task
2. 编写代码
# vim test.c
#include <stdlib.h>

int main() {
    int *ptr;
    while (1) {
        ptr = malloc(1024 * 1024); // 分配1MB内存
        if (ptr == NULL) {
            break;
        }
    }
    return 0;
}
# gcc test.c -o test
3.另开终端,执行# tail -f /var/log/message |grep memory
再开一终端,执行top
在原来的终端执行./test

预期
1. 值为1
2. 编译成功
3.该程序会不断分配内存,直到分配失败为止,模拟出OOM的情况。当系统发生OOM时,可以观察到内存使用量急剧增加,然后进程被系统杀死。
# ./test
Killed
日志端输出 Out of memory (oom_kill_allocating_task): Killed process 6633 (oomtest) 

2.低优先级内存赴死

步骤
1. 打开2个终端,终端1执行iotop,终端2执行htop,作为被观测进程,不要关掉
2. 获取iotop和htop的PID号
# ps -ef |grep top
3. 赋予iotop最低优先级,使得触发OOM时,最先杀死此进程
# echo 1000 > /proc/PID/oom_score_adj
赋予htop次低优先级,使得触发OOM时,第二杀死此进程
# echo 999 > /proc/PID/oom_score_adj
4. 创建core.sh,此脚本可以获取当前系统上 oom_score分数最高(最容易被 OOM Killer 杀掉)的进程
#!/bin/bash
for proc in $(find /proc -maxdepth 1 -regex '/proc/[0-9]+'); do
printf "%2d %5d %s\n" \
"$(cat $proc/oom_score)" \
"$(basename $proc)" \
"$(cat $proc/cmdline | tr '\0' ' ' | head -c 50)"
done 2>/dev/null | sort -nr | head -n 10
执行# sh core.sh
5. 确保 oom_kill_allocating_task和anic_on_oom的值为0
查看# sysctl -a | grep oom
vm.oom_kill_allocating_task = 0
vm.panic_on_oom = 0
修改值执行:
# sysctl -w vm.panic_on_oom=0
# sysctl -w vm.oom_kill_allocating_task=0
6. 编写一个吃内存程序,该程序会不断分配内存,直到分配失败为止,模拟出OOM的情况。
# vim oomtest.c
#include <stdlib.h>

int main() {
    int *ptr;
    while (1) {
        ptr = malloc(1024 * 1024); // 分配1MB内存
        if (ptr == NULL) {
            break;
        }
    }
    return 0;
}
编译 # gcc oomtest.c -o oomtest
7. 再开一终端,观测吃内存程序的内存占用状态,执行 # top |grep oomtest
再开一终端,获取OOM日志,执行# tail -f /var/log/messages |grep memory
执行前面生成的吃内存程序 # ./oomtest

预期
1. 执行成功
2. 执行成功
3. 执行成功
4. 执行# sh core.sh,可以看到iotop在第一位,htop在第二位
# ./core.sh
1334 232867 /usr/bin/python3 -s /usr/sbin/iotop
1332 234316 htop
669   886 /usr/bin/python3 -s /usr/sbin/firewalld --nofork -
668   912 /usr/bin/python3 -Es /usr/sbin/tuned -l -P
667 230264 sshd: root [priv]
667 230256 sshd: root [priv]
667 229038 sshd: root [priv]
667 228735 sshd: root [priv]
667 228690 sshd: root [priv]
667 228646 sshd: root [priv]

5. # sysctl -a | grep oom
vm.oom_kill_allocating_task = 0
vm.panic_on_oom = 0
6. 编译成功
7. (1)可以观测到oomtest进程的占用率先迅速升高,然后降低,最终不再输出
# top |grep oomtest
 242516 root      20   0  122.7g 502208    804 R   8.7  18.6   0:00.26 oomtest
 242516 root      20   0  618.7g   1.1g    264 D  56.5  42.6   0:01.96 oomtest
 242516 root      20   0  845.4g 694172     76 D  33.1  25.7   0:02.96 oomtest

(2)iotop和htop进程都已停止
(3)oomtest进程显示Killed
# ./oomtest
Killed
(4)日志有Out of memory: Killed 字段,说明触发了OOM机制
日志有oom_score_adj字段,可以看到程序kill的顺序为oom_score_adj由高到低,说明根据优先级进行回收
# tail -f /var/log/messages |grep memory
Jun 21 17:46:55 localhost kernel: [263075.517935]  out_of_memory+0xec/0x370
Jun 21 17:46:55 localhost kernel: [263075.518030] Tasks state (memory values in pages):
Jun 21 17:46:55 localhost kernel: [263075.518156] Out of memory: Killed process 232867 (iotop) total-vm:33972kB, anon-rss:2796kB, file-rss:12kB, shmem-rss:0kB, UID:0 pgtables:80kB oom_score_adj:1000
Jun 21 17:46:55 localhost kernel: [263075.541906]  out_of_memory+0xec/0x370
Jun 21 17:46:55 localhost kernel: [263075.542002] Tasks state (memory values in pages):
Jun 21 17:46:55 localhost kernel: [263075.542129] Out of memory: Killed process 234316 (htop) total-vm:23064kB, anon-rss:752kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:64kB oom_score_adj:999
Jun 21 17:46:55 localhost kernel: [263075.717151]  out_of_memory+0xec/0x370
Jun 21 17:46:55 localhost kernel: [263075.717246] Tasks state (memory values in pages):
Jun 21 17:46:55 localhost kernel: [263075.717380] Out of memory: Killed process 240735 (oomtest) total-vm:934221864kB, anon-rss:633368kB, file-rss:132kB, shmem-rss:0kB, UID:0 pgtables:1828256kB oom_score_adj:0