错误日志
org.apache.flink.runtime.rest.handler.RestHandlerException: Could not execute application. at org.apache.flink.runtime.webmonitor.handlers.JarRunHandler.lambda$handleRequest$1(JarRunHandler.java:103) at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822) at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1595) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.concurrent.CompletionException: java.lang.OutOfMemoryError: Metaspace at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273) at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280) at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1592) ... 7 more Caused by: java.lang.OutOfMemoryError: Metaspace
调查过程:
-
重新构建Flink 镜像,手动启动jobmanager进程,使进程号不为1(jmap无法获取进程号为1的堆栈信息)
./bin/jobmanager.sh start
- 修改flink-conf.yaml 将元数据内存调为512M
- 再现问题,上传SQL执行jar包,反复执行,直到出现问题
-
jmap获取堆栈信息
jmap -histo:live 2083 | head -n 100 jmap -dump:format=b,file=heap.bin 2083
- 解析日志,发现每次执行SQL,org.apache.hadoop.hive.conf.HiveConf$ConfVars重复加载,推测hive相关包引起得问题
- 验证,flink运行环境集成hive,kafka等需要的jar包,重新打包SQL计算程序,将对应的hive和kafka的包设置为provider
- 上传jar包重新验证,hive相关类不再重复加载,SQL可以执行的次数也有所增多,问题得到缓解
- 调查Flink类加载机制,发现flink提供了两种类加载策略,child-first(默认)和parent-first,parent-first的含义就是优先从Flink集群加载类,如果没有该类就从用户的jar包中加载类。child-first则优先从用户的jar包中加载类,这样做的好处是统一flink环境,不同版本的依赖的jar包都可以运行,解决了jar包冲突的问题,但也间接导致了重复加载类导致metaspace内存溢出的问题
堆栈日志
19: 12078 579744 java.util.HashMap 20: 7406 554344 [Ljava.lang.String; 21: 294 550752 [Ljava.nio.ByteBuffer; 22: 30691 491056 java.lang.Object 23: 14786 473152 java.util.Hashtable$Entry 24: 5525 442000 java.lang.reflect.Constructor 25: 10055 402200 java.lang.ref.SoftReference 26: 141 334992 [Ljava.nio.channels.SelectionKey; 27: 7799 311960 java.util.WeakHashMap$Entry 28: 155 265360 [Lorg.apache.flink.shaded.netty4.io.netty.buffer.PoolSubpage; 29: 4707 263592 java.lang.Package 30: 9516 228384 java.util.ArrayList 31: 5628 225120 java.lang.ref.Finalizer 32: 1989 224560 [Ljava.util.Hashtable$Entry; 33: 980 207560 [Z 34: 5760 184320 org.apache.flink.shaded.netty4.io.netty.buffer.PoolThreadCache$SubPageMemoryRegionCache 35: 2746 175744 org.apache.flink.shaded.netty4.io.netty.buffer.PoolSubpage 36: 3119 174664 java.util.LinkedHashMap 37: 5010 160320 java.lang.ref.WeakReference 38: 2802 156912 java.lang.invoke.MemberName 39: 4256 136192 org.apache.flink.shaded.netty4.io.netty.util.Recycler$WeakOrderQueue$Link 40: 4217 134944 org.apache.flink.shaded.netty4.io.netty.util.Recycler$WeakOrderQueue 41: 32 131584 [Lakka.actor.ActorRef; 42: 1050 127472 [Ljava.util.WeakHashMap$Entry; 43: 1732 124704 sun.reflect.DelegatingClassLoader 44: 3643 116576 java.util.Vector 45: 2614 104560 java.lang.invoke.MethodType 46: 6353 101648 java.lang.Integer 47: 4217 101208 org.apache.flink.shaded.netty4.io.netty.util.Recycler$WeakOrderQueue$Head 48: 259 97384 java.lang.Thread 49: 3001 96032 sun.reflect.UnsafeObjectFieldAccessorImpl 50: 3964 95136 akka.dispatch.AbstractNodeQueue$Node 51: 2911 93152 java.util.concurrent.FutureTask 52: 1924 92352 java.util.Hashtable 53: 162 92016 org.apache.flink.shaded.netty4.io.netty.util.internal.shaded.org.jctools.queues.MpscUnboundedArrayQueue 54: 2728 87296 java.lang.ThreadLocal$ThreadLocalMap$Entry 55: 2624 83968 java.lang.invoke.MethodType$ConcurrentWeakInternSet$WeakEntry 56: 1740 83520 org.apache.flink.runtime.checkpoint.MinMaxAvgStats 57: 1026 82880 [J 58: 2519 80608 org.apache.flink.shaded.netty4.io.netty.util.Recycler$DefaultHandle 59: 1935 77400 java.security.ProtectionDomain 60: 1022 73584 org.apache.hadoop.hive.conf.HiveConf$ConfVars 61: 1022 73584 org.apache.hadoop.hive.conf.HiveConf$ConfVars 62: 1022 73584 org.apache.hadoop.hive.conf.HiveConf$ConfVars 63: 1022 73584 org.apache.hadoop.hive.conf.HiveConf$ConfVars 64: 1022 73584 org.apache.hadoop.hive.conf.HiveConf$ConfVars 65: 1022 73584 org.apache.hadoop.hive.conf.HiveConf$ConfVars 66: 1022 73584 org.apache.hadoop.hive.conf.HiveConf$ConfVars 67: 1022 73584 org.apache.hadoop.hive.conf.HiveConf$ConfVars 68: 1022 73584 org.apache.hadoop.hive.conf.HiveConf$ConfVars 69: 1022 73584 org.apache.hadoop.hive.conf.HiveConf$ConfVars 70: 1022 73584 org.apache.hadoop.hive.conf.HiveConf$ConfVars 71: 1022 73584 org.apache.hadoop.hive.conf.HiveConf$ConfVars 72: 1022 73584 org.apache.hadoop.hive.conf.HiveConf$ConfVars 73: 1143 73152 java.nio.DirectByteBuffer 74: 2938 70512 akka.actor.LightArrayRevolverScheduler$TaskHolder 75: 2911 69864 akka.actor.Scheduler$$anon$3 76: 2911 69864 org.apache.flink.runtime.rpc.messages.LocalFencedMessage 77: 2911 69864 org.apache.flink.runtime.rpc.messages.RunAsync 78: 1057 67648 java.util.concurrent.ConcurrentHashMap 79: 4 65728 [Lcom.alibaba.fastjson.util.IdentityHashMap$Entry; 80: 4 65728 [Lcom.alibaba.fastjson.util.IdentityHashMap$Entry; 81: 4 65728 [Lcom.alibaba.fastjson.util.IdentityHashMap$Entry; 82: 4 65728 [Lcom.alibaba.fastjson.util.IdentityHashMap$Entry; 83: 2038 65216 java.lang.invoke.DirectMethodHandle 84: 2590 62160 java.util.regex.Pattern$5 85: 3883 62128 java.util.HashSet 86: 1898 60736 java.security.CodeSource 87: 941 60224 java.net.URL 88: 1483 59320 java.util.TreeMap$Entry 89: 3265 52240 sun.reflect.DelegatingMethodAccessorImpl 90: 1044 50112 java.util.WeakHashMap 91: 783 50112 org.apache.flink.runtime.checkpoint.SubtaskStateStats 92: 466 48416 [Ljava.lang.ThreadLocal$ThreadLocalMap$Entry; 93: 1950 46800 sun.reflect.NativeConstructorAccessorImpl 94: 1883 45192 sun.reflect.generics.tree.SimpleClassTypeSignature 95: 768 43008 sun.nio.cs.UTF_8$Encoder 96: 1720 41280 sun.reflect.NativeMethodAccessorImpl 97: 572 41184 java.util.regex.Pattern
解决办法
调整Flink运行环境,集成常用的hive,kafka,elasticsearch等相关的jar包,重新打包程序,将已包含的jar包配置成provider。