Flink Metaspace OOM问题排查

发布时间 2024-01-09 11:18:53作者: 粒子先生

错误日志

org.apache.flink.runtime.rest.handler.RestHandlerException: Could not execute application. at
org.apache.flink.runtime.webmonitor.handlers.JarRunHandler.lambda$handleRequest$1(JarRunHandler.java:103) at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822) at
java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) at
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1595) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at
java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.concurrent.CompletionException:
java.lang.OutOfMemoryError: Metaspace at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273) at
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280) at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1592) ... 7 more Caused by:
java.lang.OutOfMemoryError: Metaspace

调查过程:

  • 重新构建Flink 镜像,手动启动jobmanager进程,使进程号不为1(jmap无法获取进程号为1的堆栈信息) 

./bin/jobmanager.sh start
  • 修改flink-conf.yaml 将元数据内存调为512M
  • 再现问题,上传SQL执行jar包,反复执行,直到出现问题
  • jmap获取堆栈信息

     
jmap -histo:live 2083 | head -n 100
jmap -dump:format=b,file=heap.bin 2083
  • 解析日志,发现每次执行SQL,org.apache.hadoop.hive.conf.HiveConf$ConfVars重复加载,推测hive相关包引起得问题
  • 验证,flink运行环境集成hive,kafka等需要的jar包,重新打包SQL计算程序,将对应的hive和kafka的包设置为provider
  • 上传jar包重新验证,hive相关类不再重复加载,SQL可以执行的次数也有所增多,问题得到缓解
  • 调查Flink类加载机制,发现flink提供了两种类加载策略,child-first(默认)和parent-first,parent-first的含义就是优先从Flink集群加载类,如果没有该类就从用户的jar包中加载类。child-first则优先从用户的jar包中加载类,这样做的好处是统一flink环境,不同版本的依赖的jar包都可以运行,解决了jar包冲突的问题,但也间接导致了重复加载类导致metaspace内存溢出的问题

堆栈日志

19: 12078 579744 java.util.HashMap
20: 7406 554344 [Ljava.lang.String;
21: 294 550752 [Ljava.nio.ByteBuffer;
22: 30691 491056 java.lang.Object
23: 14786 473152 java.util.Hashtable$Entry
24: 5525 442000 java.lang.reflect.Constructor
25: 10055 402200 java.lang.ref.SoftReference
26: 141 334992 [Ljava.nio.channels.SelectionKey;
27: 7799 311960 java.util.WeakHashMap$Entry
28: 155 265360 [Lorg.apache.flink.shaded.netty4.io.netty.buffer.PoolSubpage;
29: 4707 263592 java.lang.Package
30: 9516 228384 java.util.ArrayList
31: 5628 225120 java.lang.ref.Finalizer
32: 1989 224560 [Ljava.util.Hashtable$Entry;
33: 980 207560 [Z
34: 5760 184320 org.apache.flink.shaded.netty4.io.netty.buffer.PoolThreadCache$SubPageMemoryRegionCache
35: 2746 175744 org.apache.flink.shaded.netty4.io.netty.buffer.PoolSubpage
36: 3119 174664 java.util.LinkedHashMap
37: 5010 160320 java.lang.ref.WeakReference
38: 2802 156912 java.lang.invoke.MemberName
39: 4256 136192 org.apache.flink.shaded.netty4.io.netty.util.Recycler$WeakOrderQueue$Link
40: 4217 134944 org.apache.flink.shaded.netty4.io.netty.util.Recycler$WeakOrderQueue
41: 32 131584 [Lakka.actor.ActorRef;
42: 1050 127472 [Ljava.util.WeakHashMap$Entry;
43: 1732 124704 sun.reflect.DelegatingClassLoader
44: 3643 116576 java.util.Vector
45: 2614 104560 java.lang.invoke.MethodType
46: 6353 101648 java.lang.Integer
47: 4217 101208 org.apache.flink.shaded.netty4.io.netty.util.Recycler$WeakOrderQueue$Head
48: 259 97384 java.lang.Thread
49: 3001 96032 sun.reflect.UnsafeObjectFieldAccessorImpl
50: 3964 95136 akka.dispatch.AbstractNodeQueue$Node
51: 2911 93152 java.util.concurrent.FutureTask
52: 1924 92352 java.util.Hashtable
53: 162 92016 org.apache.flink.shaded.netty4.io.netty.util.internal.shaded.org.jctools.queues.MpscUnboundedArrayQueue
54: 2728 87296 java.lang.ThreadLocal$ThreadLocalMap$Entry
55: 2624 83968 java.lang.invoke.MethodType$ConcurrentWeakInternSet$WeakEntry
56: 1740 83520 org.apache.flink.runtime.checkpoint.MinMaxAvgStats
57: 1026 82880 [J
58: 2519 80608 org.apache.flink.shaded.netty4.io.netty.util.Recycler$DefaultHandle
59: 1935 77400 java.security.ProtectionDomain
60: 1022 73584 org.apache.hadoop.hive.conf.HiveConf$ConfVars
61: 1022 73584 org.apache.hadoop.hive.conf.HiveConf$ConfVars
62: 1022 73584 org.apache.hadoop.hive.conf.HiveConf$ConfVars
63: 1022 73584 org.apache.hadoop.hive.conf.HiveConf$ConfVars
64: 1022 73584 org.apache.hadoop.hive.conf.HiveConf$ConfVars
65: 1022 73584 org.apache.hadoop.hive.conf.HiveConf$ConfVars
66: 1022 73584 org.apache.hadoop.hive.conf.HiveConf$ConfVars
67: 1022 73584 org.apache.hadoop.hive.conf.HiveConf$ConfVars
68: 1022 73584 org.apache.hadoop.hive.conf.HiveConf$ConfVars
69: 1022 73584 org.apache.hadoop.hive.conf.HiveConf$ConfVars
70: 1022 73584 org.apache.hadoop.hive.conf.HiveConf$ConfVars
71: 1022 73584 org.apache.hadoop.hive.conf.HiveConf$ConfVars
72: 1022 73584 org.apache.hadoop.hive.conf.HiveConf$ConfVars
73: 1143 73152 java.nio.DirectByteBuffer
74: 2938 70512 akka.actor.LightArrayRevolverScheduler$TaskHolder
75: 2911 69864 akka.actor.Scheduler$$anon$3
76: 2911 69864 org.apache.flink.runtime.rpc.messages.LocalFencedMessage
77: 2911 69864 org.apache.flink.runtime.rpc.messages.RunAsync
78: 1057 67648 java.util.concurrent.ConcurrentHashMap
79: 4 65728 [Lcom.alibaba.fastjson.util.IdentityHashMap$Entry;
80: 4 65728 [Lcom.alibaba.fastjson.util.IdentityHashMap$Entry;
81: 4 65728 [Lcom.alibaba.fastjson.util.IdentityHashMap$Entry;
82: 4 65728 [Lcom.alibaba.fastjson.util.IdentityHashMap$Entry;
83: 2038 65216 java.lang.invoke.DirectMethodHandle
84: 2590 62160 java.util.regex.Pattern$5
85: 3883 62128 java.util.HashSet
86: 1898 60736 java.security.CodeSource
87: 941 60224 java.net.URL
88: 1483 59320 java.util.TreeMap$Entry
89: 3265 52240 sun.reflect.DelegatingMethodAccessorImpl
90: 1044 50112 java.util.WeakHashMap
91: 783 50112 org.apache.flink.runtime.checkpoint.SubtaskStateStats
92: 466 48416 [Ljava.lang.ThreadLocal$ThreadLocalMap$Entry;
93: 1950 46800 sun.reflect.NativeConstructorAccessorImpl
94: 1883 45192 sun.reflect.generics.tree.SimpleClassTypeSignature
95: 768 43008 sun.nio.cs.UTF_8$Encoder
96: 1720 41280 sun.reflect.NativeMethodAccessorImpl
97: 572 41184 java.util.regex.Pattern

解决办法

调整Flink运行环境,集成常用的hive,kafka,elasticsearch等相关的jar包,重新打包程序,将已包含的jar包配置成provider。