如何分析PostgreSQL崩溃转储文件

发布时间 2024-01-10 21:26:09作者: jl1771

1. 介绍

在这篇博客文章中,我将讨论如何启用崩溃转储文件(也称为核心转储)的生成以及一些常用的GDB命令,以帮助开发人员在PostgreSQL和其他应用程序中解决与崩溃相关的问题。正确的问题分析通常需要时间和对应用程序源代码的一定程度的了解。从经验来看,有时考虑更大的环境而不是考虑崩溃点可能会更好。

2. 什么是崩溃转储文件?

崩溃转储文件是由应用程序崩溃时工作内存的记录状态组成的文件。这种状态由内存地址和CPU寄存器的堆栈表示,通常情况下,仅使用内存地址和CPU寄存器调试是非常困难的,因为它们不会告诉您有关应用程序逻辑的信息。考虑到下面的核心转储内容,其中显示了从内存地址到崩溃点的回溯跟踪。

#1  0x00687a3d in ?? ()
#2  0x00d37f06 in ?? ()
#3  0x00bf0ba4 in ?? ()
#4  0x00d3333b in ?? ()
#5  0x00d3f682 in ?? ()
#6  0x00d3407b in ?? ()
#7  0x00d3f2f7 in ?? ()

不是很有用吗?因此,当我们看到像这样的崩溃转储文件时,这意味着应用程序没有使用调试符号构建,使得这个崩溃转储文件毫无用处。如果是这种情况,则需要安装应用程序的调试版本,或在启用调试的情况下重新构建应用程序。

3. 如何生成一个有用的崩溃转储文件

在生成崩溃转储文件之前,我们需要确保应用程序是使用调试符号构建的。可以通过执行. / configure完成:

./configure enable-debug

-g 参数添加到 src/Makefile 中的CFLAGS中。全局优化级别设置为2 (-O2)。我也倾向于改变优化为0 (-o0)当我们导航堆栈使用GDB,导航会更有意义而不是跳来跳去,我们将能够打印出大多数变量值在内存中,而不是在GDB得到优化的错误。

CFLAGS = -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -g -O0

现在,我们可以启用崩溃转储生成。这可以通过user limit命令完成。

ulimit -c unlimited

禁用:

ulimit -c 0

请确保有足够的磁盘空间,因为崩溃转储文件通常非常大,因为它记录了从开始到崩溃的所有内存执行状态,并确保在启动PostgreSQL之前在shell中设置了 ulimit。当PostgreSQL崩溃时,一个名为core的 core 转储文件将在 $PGDATA 中生成。

4. 使用GDB分析转储文件

GDB (GNU Debugger)是一个可移植的调试器,可以在许多类unix系统上运行,并且可以与许多编程语言一起工作,它是我最喜欢的分析崩溃转储文件的工具。为了演示这一点,我将故意在PostgreSQL源代码中添加一行代码,当运行CREATE TABLE命令时,这会导致segmentation fault 崩溃类型。

假设PostgreSQL已经崩溃并在~/highgo/git/postgres/postgresdb/core位置生成了一个核心转储文件core。我会首先使用 file工具来更多地了解核心文件。诸如内核信息以及生成它的程序之类的信息。

caryh@HGPC01:~$ file /home/caryh/highgo/git/postgres/postgresdb/core
postgresdb/core: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from 'postgres: cary cary [local] CREATE TABLE', real uid: 1000, effective uid: 1000, real gid: 1000, effective gid: 1000, execfn: '/home/caryh/highgo/git/postgres/highgo/bin/postgres', platform: 'x86_64'
caryh@HGPC01:~$

file工具告诉我核心文件是由这个/home/caryh/highgo/git/postgres/highgo/bin/postgres应用程序生成的,所以我将像这样执行gdb:

gdb /home/caryh/highgo/git/postgres/highgo/bin/postgres -c  /home/caryh/highgo/git/postgres/postgresdb/core

GNU gdb (Ubuntu 8.1-0ubuntu3) 8.1.0.20180409-git
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /home/caryh/highgo/git/postgres/highgo/bin/postgres...done.
[New LWP 27417]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: cary cary [local] CREATE TABLE                                 '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  heap_insert (relation=relation@entry=0x7f872f532228, tup=tup@entry=0x55ba8290f778, cid=0, options=options@entry=0,
    bistate=bistate@entry=0x0) at heapam.c:1840
1840            ereport(LOG,(errmsg("heap tuple len = %d", heaptup->t_len)));
(gdb)

core文件上运行gdb后,它立即显示了崩溃的位置 heapam.c:1840,这正是我故意添加的导致崩溃的那一行。

5. 有用的GDB命令

使用gdb,很容易识别崩溃的位置,因为它在core文件上运行gdb后立即告诉你。不幸的是,崩溃的位置有95%不是问题的真正原因。这就是为什么我之前提到,有时候着眼于更大的环境可能比着眼于崩溃点更好。崩溃可能是由于应用程序逻辑中的错误导致的,而错误发生在应用程序崩溃之前的某个地方。即使你修复了崩溃,应用程序逻辑中的错误仍然存在,而且最有可能的情况是,应用程序稍后会在其他地方崩溃或产生不令人满意的结果。因此,有必要了解一些强大的GDB命令,这些命令可以帮助我们更好地理解调用栈,以确定真正的根本原因。

5.1 bt(Back Trace)命令

bt命令显示了从应用程序开始一直到崩溃的一系列调用堆栈。在完全启用调试功能的情况下,你将能够看到传递给每个函数调用的函数参数和值,以及调用它们的源文件和行号。这允许开发人员回溯检查早期处理中任何潜在的应用程序逻辑错误。

(gdb) bt
#0  heap_insert (relation=relation@entry=0x7f872f532228, tup=tup@entry=0x55ba8290f778, cid=0, options=options@entry=0,
    bistate=bistate@entry=0x0) at heapam.c:1840
#1  0x000055ba81ccde3e in simple_heap_insert (relation=relation@entry=0x7f872f532228, tup=tup@entry=0x55ba8290f778)
    at heapam.c:2356
#2  0x000055ba81d7826d in CatalogTupleInsert (heapRel=0x7f872f532228, tup=0x55ba8290f778) at indexing.c:228
#3  0x000055ba81d946ea in TypeCreate (newTypeOid=newTypeOid@entry=0, typeName=typeName@entry=0x7ffcf56ef820 "test",
    typeNamespace=typeNamespace@entry=2200, relationOid=relationOid@entry=16392, relationKind=relationKind@entry=114 'r',
    ownerId=ownerId@entry=16385, internalSize=-1, typeType=99 'c', typeCategory=67 'C', typePreferred=false,
    typDelim=44 ',', inputProcedure=2290, outputProcedure=2291, receiveProcedure=2402, sendProcedure=2403,
    typmodinProcedure=0, typmodoutProcedure=0, analyzeProcedure=0, elementType=0, isImplicitArray=false, arrayType=16393,
    baseType=0, defaultTypeValue=0x0, defaultTypeBin=0x0, passedByValue=false, alignment=100 'd', storage=120 'x',
    typeMod=-1, typNDims=0, typeNotNull=false, typeCollation=0) at pg_type.c:484
#4  0x000055ba81d710bc in AddNewRelationType (new_array_type=16393, new_row_type=<optimized out>, ownerid=<optimized out>,
    new_rel_kind=<optimized out>, new_rel_oid=<optimized out>, typeNamespace=2200, typeName=0x7ffcf56ef820 "test")
    at heap.c:1033
#5  heap_create_with_catalog (relname=relname@entry=0x7ffcf56ef820 "test", relnamespace=relnamespace@entry=2200,
    reltablespace=reltablespace@entry=0, relid=16392, relid@entry=0, reltypeid=reltypeid@entry=0,
    reloftypeid=reloftypeid@entry=0, ownerid=16385, accessmtd=2, tupdesc=0x55ba8287c620, cooked_constraints=0x0,
    relkind=114 'r', relpersistence=112 'p', shared_relation=false, mapped_relation=false, oncommit=ONCOMMIT_NOOP,
    reloptions=0, use_user_acl=true, allow_system_table_mods=false, is_internal=false, relrewrite=0, typaddress=0x0)
    at heap.c:1294
#6  0x000055ba81e3782a in DefineRelation (stmt=stmt@entry=0x55ba82876658, relkind=relkind@entry=114 'r', ownerId=16385,
    ownerId@entry=0, typaddress=typaddress@entry=0x0,
    queryString=queryString@entry=0x55ba82855648 "create table test (a int, b char(10)) using heap;") at tablecmds.c:885
#7  0x000055ba81fd5b2f in ProcessUtilitySlow (pstate=pstate@entry=0x55ba82876548, pstmt=pstmt@entry=0x55ba828565a0,
    queryString=queryString@entry=0x55ba82855648 "create table test (a int, b char(10)) using heap;",
    context=context@entry=PROCESS_UTILITY_TOPLEVEL, params=params@entry=0x0, queryEnv=queryEnv@entry=0x0, qc=0x7ffcf56efe50,
    dest=0x55ba82856860) at utility.c:1161
#8  0x000055ba81fd4120 in standard_ProcessUtility (pstmt=0x55ba828565a0,
    queryString=0x55ba82855648 "create table test (a int, b char(10)) using heap;", context=PROCESS_UTILITY_TOPLEVEL,
    params=0x0, queryEnv=0x0, dest=0x55ba82856860, qc=0x7ffcf56efe50) at utility.c:1069
#9  0x000055ba81fd1962 in PortalRunUtility (portal=0x55ba828b7dd8, pstmt=0x55ba828565a0, isTopLevel=<optimized out>,
    setHoldSnapshot=<optimized out>, dest=<optimized out>, qc=0x7ffcf56efe50) at pquery.c:1157
#10 0x000055ba81fd23e3 in PortalRunMulti (portal=portal@entry=0x55ba828b7dd8, isTopLevel=isTopLevel@entry=true,
    setHoldSnapshot=setHoldSnapshot@entry=false, dest=dest@entry=0x55ba82856860, altdest=altdest@entry=0x55ba82856860,
    qc=qc@entry=0x7ffcf56efe50) at pquery.c:1310
#11 0x000055ba81fd2f51 in PortalRun (portal=portal@entry=0x55ba828b7dd8, count=count@entry=9223372036854775807,
    isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x55ba82856860,
    altdest=altdest@entry=0x55ba82856860, qc=0x7ffcf56efe50) at pquery.c:779
#12 0x000055ba81fce967 in exec_simple_query (query_string=0x55ba82855648 "create table test (a int, b char(10)) using heap;")
    at postgres.c:1239
#13 0x000055ba81fd0d7e in PostgresMain (argc=<optimized out>, argv=argv@entry=0x55ba8287fdb0, dbname=<optimized out>,
    username=<optimized out>) at postgres.c:4315
#14 0x000055ba81f4f52a in BackendRun (port=0x55ba82877110, port=0x55ba82877110) at postmaster.c:4536
#15 BackendStartup (port=0x55ba82877110) at postmaster.c:4220
#16 ServerLoop () at postmaster.c:1739
#17 0x000055ba81f5063f in PostmasterMain (argc=3, argv=0x55ba8284fee0) at postmaster.c:1412
#18 0x000055ba81c91c04 in main (argc=3, argv=0x55ba8284fee0) at main.c:210
(gdb)

5.1 f (Fly)命令

f命令后跟一个栈号,允许gdb跳转到 bt 命令列出的特定调用栈,并允许您在该特定栈中打印其他变量。例如:

(gdb) f 3
#3  0x000055ba81d946ea in TypeCreate (newTypeOid=newTypeOid@entry=0, typeName=typeName@entry=0x7ffcf56ef820 "test",
    typeNamespace=typeNamespace@entry=2200, relationOid=relationOid@entry=16392, relationKind=relationKind@entry=114 'r',
    ownerId=ownerId@entry=16385, internalSize=-1, typeType=99 'c', typeCategory=67 'C', typePreferred=false,
    typDelim=44 ',', inputProcedure=2290, outputProcedure=2291, receiveProcedure=2402, sendProcedure=2403,
    typmodinProcedure=0, typmodoutProcedure=0, analyzeProcedure=0, elementType=0, isImplicitArray=false, arrayType=16393,
    baseType=0, defaultTypeValue=0x0, defaultTypeBin=0x0, passedByValue=false, alignment=100 'd', storage=120 'x',
    typeMod=-1, typNDims=0, typeNotNull=false, typeCollation=0) at pg_type.c:484
484                     CatalogTupleInsert(pg_type_desc, tup);
(gdb)

这会强制gdb跳转到pg_type.c:484处的3号栈。在这里,您可以检查此框架中的所有其他变量(在函数TypeCreate中)。

5.2 p(Print)命令

这是gdb中最常用的命令,可以用来打印变量地址和值。

(gdb) p tup
$1 = (HeapTuple) 0x55ba8290f778
(gdb) p pg_type_desc
$2 = (Relation) 0x7f872f532228

(gdb)  p * tup
$3 = {t_len = 176, t_self = {ip_blkid = {bi_hi = 65535, bi_lo = 65535}, ip_posid = 0}, t_tableOid = 0,
  t_data = 0x55ba8290f790}

(gdb) p * pg_type_desc
$4 = {rd_node = {spcNode = 1663, dbNode = 16384, relNode = 1247}, rd_smgr = 0x55ba828e2a38, rd_refcnt = 2, rd_backend = -1,
  rd_islocaltemp = false, rd_isnailed = true, rd_isvalid = true, rd_indexvalid = true, rd_statvalid = false,
  rd_createSubid = 0, rd_newRelfilenodeSubid = 0, rd_firstRelfilenodeSubid = 0, rd_droppedSubid = 0,
  rd_rel = 0x7f872f532438, rd_att = 0x7f872f532548, rd_id = 1247, rd_lockInfo = {lockRelId = {relId = 1247, dbId = 16384}},
  rd_rules = 0x0, rd_rulescxt = 0x0, trigdesc = 0x0, rd_rsdesc = 0x0, rd_fkeylist = 0x0, rd_fkeyvalid = false,
  rd_partkey = 0x0, rd_partkeycxt = 0x0, rd_partdesc = 0x0, rd_pdcxt = 0x0, rd_partcheck = 0x0, rd_partcheckvalid = false,
  rd_partcheckcxt = 0x0, rd_indexlist = 0x7f872f477d00, rd_pkindex = 0, rd_replidindex = 0, rd_statlist = 0x0,
  rd_indexattr = 0x0, rd_keyattr = 0x0, rd_pkattr = 0x0, rd_idattr = 0x0, rd_pubactions = 0x0, rd_options = 0x0,
  rd_amhandler = 0, rd_tableam = 0x55ba82562c20 <heapam_methods>, rd_index = 0x0, rd_indextuple = 0x0, rd_indexcxt = 0x0,
  rd_indam = 0x0, rd_opfamily = 0x0, rd_opcintype = 0x0, rd_support = 0x0, rd_supportinfo = 0x0, rd_indoption = 0x0,
  rd_indexprs = 0x0, rd_indpred = 0x0, rd_exclops = 0x0, rd_exclprocs = 0x0, rd_exclstrats = 0x0, rd_indcollation = 0x0,
  rd_opcoptions = 0x0, rd_amcache = 0x0, rd_fdwroutine = 0x0, rd_toastoid = 0, pgstat_info = 0x55ba828d5cb0}
(gdb)

使用*,你可以告诉p命令打印指针的地址或指针指向的值。

5.3 x(examine)命令

x命令用于检查指定大小和格式的内存块内容。下面的例子试图检查 HeapTuple 结构中的 t_data 值。注意,我们首先打印 *tup 指针来了解 t_data 的大小是176,然后我们使用 x 命令来检查 t_data 指向的前176个字节。

(gdb)  p *tup
$6 = {t_len = 176, t_self = {ip_blkid = {bi_hi = 65535, bi_lo = 65535}, ip_posid = 0}, t_tableOid = 0,
  t_data = 0x55ba8290f790}

(gdb)  p tup->t_data
$7 = (HeapTupleHeader) 0x55ba8290f790
(gdb) x/176bx  tup->t_data
0x55ba8290f790: 0xc0    0x02    0x00    0x00    0xff    0xff    0xff    0xff
0x55ba8290f798: 0x47    0x00    0x00    0x00    0xff    0xff    0xff    0xff
0x55ba8290f7a0: 0x00    0x00    0x1f    0x00    0x01    0x00    0x20    0xff
0x55ba8290f7a8: 0xff    0xff    0x0f    0x00    0x00    0x00    0x00    0x00
0x55ba8290f7b0: 0x0a    0x40    0x00    0x00    0x74    0x65    0x73    0x74
0x55ba8290f7b8: 0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00
0x55ba8290f7c0: 0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00
0x55ba8290f7c8: 0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00
0x55ba8290f7d0: 0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00
0x55ba8290f7d8: 0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00
0x55ba8290f7e0: 0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00
0x55ba8290f7e8: 0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00
0x55ba8290f7f0: 0x00    0x00    0x00    0x00    0x98    0x08    0x00    0x00
0x55ba8290f7f8: 0x01    0x40    0x00    0x00    0xff    0xff    0x00    0x63
0x55ba8290f800: 0x43    0x00    0x01    0x2c    0x08    0x40    0x00    0x00
0x55ba8290f808: 0x00    0x00    0x00    0x00    0x09    0x40    0x00    0x00
0x55ba8290f810: 0xf2    0x08    0x00    0x00    0xf3    0x08    0x00    0x00
0x55ba8290f818: 0x62    0x09    0x00    0x00    0x63    0x09    0x00    0x00
0x55ba8290f820: 0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00
0x55ba8290f828: 0x00    0x00    0x00    0x00    0x64    0x78    0x00    0x00
0x55ba8290f830: 0x00    0x00    0x00    0x00    0xff    0xff    0xff    0xff
0x55ba8290f838: 0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00
(gdb)

6. 总结

在这篇博客中,我们讨论了如何生成一个有用的崩溃转储文件,其中包含足够的调试符号,以帮助开发人员解决PostgreSQL和其他应用程序中的崩溃问题。我们还讨论了一个非常强大和有用的调试器 gdb,并分享了一些最常用的命令,这些命令可以用来从核心文件中排除崩溃问题。我希望这里的信息可以帮助一些开发人员更好地解决问题。

本文转自:https://www.highgo.ca/2020/11/07/how-to-analyze-a-postgresql-crash-dump-file/