Doris-BE节点集体挂掉问题排查

发布时间 2023-07-18 12:01:59作者: 不要学我说话

背景

7月14版本上线,7月16日doris集群BE节点短时间内陆续挂掉,暂时重启解决,7月17日周一上班,BE节点开始反复挂掉影响使用

问题定位:

1、查看doris BE节点日志

be.out日志如下所示,由第7行(doris::PlanFragmentExecutor)可看出是因为sql执行引发的问题,需要进一步的通过CoreDump来定位到触发BE的查询

*** Aborted at 1689488662 (unix time) try "date -d @1689488662" if you are using GNU date ***
*** SIGSEGV unkown detail explain (@0x0) received by PID 44257 (TID 0x7fb793b90700) from PID 0; stack trace: ***
0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /mnt/disk2/ygl/code/github/apache-doris/be/src/common/signal_handler.h:420
1# 0x00007FB7CC97C400 in /lib64/libc.so.6
2# doris::vectorized::IAggregateFunctionHelper<doris::vectorized::AggregateFunctionCountNotNullUnary>::add_batch(unsigned long, char**, unsigned long, doris::vectorized::IColumn const**, doris::vectorized::Arena*) const at /mnt/disk2/ygl/code/github/apache-doris/be/src/vec/aggregate_functions/aggregate_function.h:151
3# doris::vectorized::AggFnEvaluator::execute_batch_add(doris::vectorized::Block*, unsigned long, char**, doris::vectorized::Arena*) at /mnt/disk2/ygl/code/github/apache-doris/be/src/vec/exprs/vectorized_agg_fn.cpp:131
4# doris::vectorized::AggregationNode::_execute_with_serialized_key(doris::vectorized::Block*) at /mnt/disk2/ygl/code/github/apache-doris/be/src/vec/exec/vaggregation_node.cpp:864
5# std::_Function_handler<doris::Status (doris::vectorized::Block*), std::_Bind_result<doris::Status, doris::Status (doris::vectorized::AggregationNode::*(doris::vectorized::AggregationNode*, std::_Placeholder<1>))(doris::vectorized::Block*)> >::_M_invoke(std::_Any_data const&, doris::vectorized::Block*&&) at /mnt/disk2/ygl/installs/ldbtools/include/c++/11/bits/std_function.h:293
6# doris::vectorized::AggregationNode::open(doris::RuntimeState*) at /mnt/disk2/ygl/code/github/apache-doris/be/src/vec/exec/vaggregation_node.cpp:375
7# doris::PlanFragmentExecutor::open_vectorized_internal() at /mnt/disk2/ygl/code/github/apache-doris/be/src/runtime/plan_fragment_executor.cpp:286
8# doris::PlanFragmentExecutor::open() at /mnt/disk2/ygl/code/github/apache-doris/be/src/runtime/plan_fragment_executor.cpp:259
9# doris::FragmentExecState::execute() at /mnt/disk2/ygl/code/github/apache-doris/be/src/runtime/fragment_mgr.cpp:248
10# doris::FragmentMgr::_exec_actual(std::shared_ptr<doris::FragmentExecState>, std::function<void (doris::PlanFragmentExecutor*)>) at /mnt/disk2/ygl/code/github/apache-doris/be/src/runtime/fragment_mgr.cpp:481
11# std::_Function_handler<void (), std::_Bind_result<void, void (doris::FragmentMgr::*(doris::FragmentMgr*, std::shared_ptr<doris::FragmentExecState>, std::function<void (doris::PlanFragmentExecutor*)>))(std::shared_ptr<doris::FragmentExecState>, std::function<void (doris::PlanFragmentExecutor*)>)> >::_M_invoke(std::_Any_data const&) at /mnt/disk2/ygl/installs/ldbtools/include/c++/11/bits/std_function.h:291
12# doris::ThreadPool::dispatch_thread() at /mnt/disk2/ygl/code/github/apache-doris/be/src/util/threadpool.cpp:578
13# doris::Thread::supervise_thread(void*) at /mnt/disk2/ygl/code/github/apache-doris/be/src/util/thread.cpp:407
14# start_thread in /lib64/libpthread.so.0
15# clone in /lib64/libc.so.6

2、如何生成CoreDump

  • 查看生成CoreDump文件的开关是否开启,输入命令ulimit -a

core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 1544256
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                     (-n) 655350
pipe size           (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority             (-r) 0
stack size             (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes             (-u) 655350
virtual memory         (kbytes, -v) unlimited
file locks                     (-x) unlimited

第一行可以看到此服务器的core file size为不限制(若为0则不生成),可以通过命令来改变CoreDump的大小,也可以在be启动脚本中增加ulimit -c unlimited -n 65536

ulimit -c 1024 #设置CoreDump文件大小为1024k
ulimit -c unlimited #不限制CoreDump文件大小
  • 查看CoreDump文件的路径

默认情况下,CoreDump生成的文件名为core,而且就在运行启动BE脚本目录下,新生成的CoreDump文件会覆盖旧的CoreDump文件。而如果proc/sys/kernel/core_uses_pid内容为1,则CoreDump文件会以core.进程id的方式被生成。(这里建议通过系统管理员将该开关打开)。如果在运行启动BE脚本目录下没有找到对应的CoreDump文件的话,可能是系统管理员修改了core_pattern。可以执行cat /proc/sys/kernel/core_pattern来查看core目录

3、利用CoreDump定位问题SQL

gdb ../lib/palo_be core.xxxx
  • 通过查询栈索引得到QueryID

    执行完上一步之后再次输入bt命令打开堆栈,找到doris::PlanFragmentExecutor(可以不断按回车键查看下一批),此处日志可以看到在栈449

    #0  add (row_num=0, columns=0x556361d9a9b8, place=0x556361cb7018 "", this=0x55638449a790)
      at /mnt/disk2/ygl/code/github/apache-doris/be/src/vec/common/pod_array.h:342
      ····················
    #447 0x0000561787fe19d8 in doris::ScanNode::prepare(doris::RuntimeState*) ()
      at /mnt/disk2/ygl/code/github/apache-doris/be/src/exec/scan_node.cpp:30
    #448 0x00005617880bedea in doris::OdbcScanNode::prepare(doris::RuntimeState*) ()
      at /mnt/disk2/ygl/code/github/apache-doris/be/src/exec/odbc_scan_node.cpp:57
    #449 0x0000561787990495 in doris::PlanFragmentExecutor::prepare(doris::TExecPlanFragmentParams const&, doris::QueryFragmentsCtx*) ()
      at /mnt/disk2/ygl/installs/ldbtools/include/c++/11/bits/unique_ptr.h:421
    #450 0x000056178790106d in doris::FragmentExecState::prepare (this=this@entry=0x5617b76a2000, params=...)
      at /mnt/disk2/ygl/code/github/apache-doris/be/src/runtime/fragment_mgr.cpp:227
    #451 0x0000561787906b87 in doris::FragmentMgr::exec_plan_fragment(doris::TExecPlanFragmentParams const&, std::function<void (doris::PlanFragmentExecutor*)>) () at /mnt/disk2/ygl/code/github/apache-doris/be/src/runtime/fragment_mgr.cpp:646
    #452 0x0000561787908bd0 in doris::FragmentMgr::exec_plan_fragment(doris::TExecPlanFragmentParams const&) ()
      at /mnt/disk2/ygl/installs/ldbtools/include/c++/11/tuple:746
    #453 0x0000561787a00cb6 in doris::PInternalServiceImpl<doris::PBackendService>::_exec_plan_fragment (this=0x5617906fe4e0,
       ser_request=..., version=<optimized out>, compact=<optimized out>)
      at /mnt/disk2/ygl/code/github/apache-doris/be/src/runtime/exec_env.h:150
    ---Type <return> to continue, or q <return> to quit---q
    Quit

    输入q再回车退出,再次输入f 449切换到栈449,再次输入p _query_id得到query_id(用hi的值即可),输入p /x query_id将query_id转换为16进制

    (gdb) f 449
    #449 0x0000561787990495 in doris::PlanFragmentExecutor::prepare(doris::TExecPlanFragmentParams const&, doris::QueryFragmentsCtx*) ()
      at /mnt/disk2/ygl/installs/ldbtools/include/c++/11/bits/unique_ptr.h:421
    421 /mnt/disk2/ygl/installs/ldbtools/include/c++/11/bits/unique_ptr.h: 没有那个文件或目录.
    (gdb) p _query_id
    $1 = {<apache::thrift::TBase> = {_vptr.TBase = 0x56178ca192a0 <vtable for doris::TUniqueId+48>}, hi = -2521141818464581758,
    lo = -7080784611811882727}
    (gdb) p /x -2521141818464581758
    $2 = 0xdd031adfaa094782

    此时需要查询所有FE的fe.audit.log来搜索(grep对应日期的fe.audit.log日志)query_id如下所示,此处通过Stmt属性看出问题sql`

[root@localhost log]# grep dd031adfaa094782 fe.audit.log.20230717-1
2023-07-17 14:08:06,389 [query] |Client=10.196.166.3:34996|User=root|Db=default_cluster:ssom_doris|State=ERR|Time=3377|ScanBytes=0|ScanRows=0|ReturnRows=0|StmtId=14121202|QueryId=dd031adfaa094782-9dbbffe941e68919|IsQuery=true|feIp=10.196.166.4|Stmt=SELECT   columns FROM table_name   WHERE  del_flag='0' AND ((condition1 = '113.108.173.100' AND condition2 = 3602959022916898816) OR (condition1 = '61.147.93.7' AND condition2 = null) OR (condition1 = '120.197.38.18' AND condition2 = null) OR (condition1 = '61.147.96.60' AND condition2 = null) OR (condition1 = '119.34.177.100' AND condition2 = null) OR (condition1 = '14.125.55.70' AND condition2 = null)....)|CpuTimeMS=0|SqlHash=e332b6574b085aa6a57425e79cbf4104|peakMemoryBytes=0

参考链接:https://www.jianshu.com/p/60a5df15093c

https://wizardforcel.gitbooks.io/100-gdb-tips/content/display-instruction-pc.html