Perf in Linux

发布时间 2023-11-06 13:51:48作者: DORAB

Perf Commands:

perf


root@f5c6cb9f45b1:/# perf 

 usage: perf [--version] [--help] [OPTIONS] COMMAND [ARGS]

 The most commonly used perf commands are:
   annotate        Read perf.data (created by perf record) and display annotated code
   archive         Create archive with object files with build-ids found in perf.data file
   bench           General framework for benchmark suites
   buildid-cache   Manage build-id cache.
   buildid-list    List the buildids in a perf.data file
   c2c             Shared Data C2C/HITM Analyzer.
   config          Get and set variables in a configuration file.
   daemon          Run record sessions on background
   data            Data file related processing
   diff            Read perf.data files and display the differential profile
   evlist          List the event names in a perf.data file
   ftrace          simple wrapper for kernel's ftrace functionality
   inject          Filter to augment the events stream with additional information
   iostat          Show I/O performance metrics
   kallsyms        Searches running kernel for symbols
   kmem            Tool to trace/measure kernel memory properties
   kvm             Tool to trace/measure kvm guest os
   list            List all symbolic event types
   lock            Analyze lock events
   mem             Profile memory accesses
   record          Run a command and record its profile into perf.data
   report          Read perf.data (created by perf record) and display the profile
   sched           Tool to trace/measure scheduler properties (latencies)
   script          Read perf.data (created by perf record) and display trace output
   stat            Run a command and gather performance counter statistics
   test            Runs sanity tests.
   timechart       Tool to visualize total system behavior during a workload
   top             System profiling tool.
   version         display the version of perf binary
   probe           Define new dynamic tracepoints
   trace           strace inspired tool

 See 'perf help COMMAND' for more information on a specific command.

perf stat -h

root@f5c6cb9f45b1:/# perf stat -h 

 Usage: perf stat [<options>] [<command>]

    -a, --all-cpus        system-wide collection from all CPUs
    -A, --no-aggr         disable CPU count aggregation
    -B, --big-num         print large numbers with thousands' separators
    -C, --cpu <cpu>       list of cpus to monitor in system-wide
    -D, --delay <n>       ms to wait before starting measurement after program start (-1: start with events disabled)
    -d, --detailed        detailed run - start a lot of events
    -e, --event <event>   event selector. use 'perf list' to list available events
    -G, --cgroup <name>   monitor event in cgroup name only
    -g, --group           put the counters into a counter group
    -I, --interval-print <n>
                          print counts at regular interval in ms (overhead is possible for values <= 100ms)
    -i, --no-inherit      child tasks do not inherit counters
    -M, --metrics <metric/metric group list>
                          monitor specified metrics or metric groups (separated by ,)
    -n, --null            null run - dont start any counters
    -o, --output <file>   output file name
    -p, --pid <pid>       stat events on existing process id
    -r, --repeat <n>      repeat command and print average + stddev (max: 100, forever: 0)
    -S, --sync            call sync() before starting a run
    -t, --tid <tid>       stat events on existing thread id
    -T, --transaction     hardware transaction statistics
    -v, --verbose         be more verbose (show counter open errors, etc)
    -x, --field-separator <separator>
                          print counts with custom separator
        --all-kernel      Configure all used events to run in kernel space.
        --all-user        Configure all used events to run in user space.
        --append          append to the output file
        --control <fd:ctl-fd[,ack-fd] or fifo:ctl-fifo[,ack-fifo]>
                          Listen on ctl-fd descriptor for command to control measurement ('enable': enable events, 'disable': disable events).
                          Optionally send control command completion ('ack\n') to ack-fd descriptor.
                          Alternatively, ctl-fifo / ack-fifo will be opened and used as ctl-fd / ack-fd.
        --filter <filter>
                          event filter
        --for-each-cgroup <name>
                          expand events for each cgroup
        --interval-clear  clear screen in between new interval
        --interval-count <n>
                          print counts for fixed number of times
        --iostat[=<default>]
                          measure I/O performance metrics provided by arch/platform
        --log-fd <n>      log output to fd, instead of stderr
        --metric-no-group
                          don't group metric events, impacts multiplexing
        --metric-no-merge
                          don't try to share events between metrics in a group
        --metric-only     Only print computed metrics. No raw values
        --no-csv-summary  don't print 'summary' for CSV summary output
        --no-merge        Do not merge identical named events
        --per-core        aggregate counts per physical processor core
        --per-die         aggregate counts per processor die
        --per-node        aggregate counts per numa node
        --per-socket      aggregate counts per processor socket
        --per-thread      aggregate counts per thread
        --percore-show-thread
                          Use with 'percore' event qualifier to show the event counts of one hardware thread by sum up total hardware threads of same physical core
        --post <command>  command to run after to the measured command
        --pre <command>   command to run prior to the measured command
        --quiet           don't print output (useful with record)
        --scale           Use --no-scale to disable counter scaling for multiplexing
        --smi-cost        measure SMI cost
        --summary         print summary for interval mode
        --table           display details about each run (only with -r option)
        --td-level <n>    Set the metrics level for the top-down statistics (0: max level)
        --timeout <n>     stop workload and print counts after a timeout period in ms (>= 10ms)
        --topdown         measure top-down statistics

Command Examples

  • The first step is collecting data. We need to collect sched_stat and sched_switch events. Sched_stat events are not enough, because they are generated in the context of a task, which wakes up a target task (e.g. releases a lock). We need the same event but with a call-chain of the target task. This call-chain can be extracted from a previous sched_switch event.
    • -e sched:sched_stat_sleep -e sched:sched_switch -e sched:sched_process_exit: These are options that specify which events perf should record:
      • sched:sched_stat_sleep: This event is recorded when a task goes to sleep, or in other words, when it is not runnable, waiting for some resource or event.
      • sched:sched_switch: This records context switch information, which occurs when the kernel switches the CPU from running one task to another.
      • sched:sched_process_exit: This captures information about process termination/exiting.
    • -g: This option enables call graph (stack frame) recording. With call graphs, you can see the path of function calls that lead to the event, which is helpful for understanding the runtime behavior of the system and identifying performance bottlenecks.
    • -o ~/perf.data.raw: This directs perf to output the raw profiling data to a file named perf.data.raw in the user's home directory (~). By default, perf writes to a file named perf.data in the current directory.
    • ~/foo: This part of the command indicates the actual workload or command to be profiled is an executable named foo located in the user's home directory. The perf record command will run this executable as the workload to profile.
  • The second step is merging sched_start and sched_switch events. It can be done with help of "perf inject -s".
    • post-process a perf.data file that was previously recorded with perf record.
    • It allows additional data to be injected into an existing perf.data file or it can filter or transform trace data in various ways.
    • -v: This is the verbose flag. When you add -v to perf commands, it runs in verbose mode, giving more detailed output about what the command is doing.
    • -s: This option tells perf inject to build synthesized events such as mmap events to resolve symbolic names for addresses within the recorded data. It is particularly useful when the data file is to be analyzed on a different machine or at a different time when the mapped memory addresses may differ.
    • -i ~/perf.data.raw: Specifies the input file from which to read perf data. This would be the raw perf data file you've named perf.data.raw in your home directory.
    • -o ~/perf.data: Defines the output file to which the resulting perf data should be written. In this case, it's a file named perf.data
    • It's a way to enrich or transform your profiling data post-capture without having to recapture it, which can save time and effort, especially if you forgot to record some types of data initially.
  • The third step is used for generating a performance report from the data collected by perf record. Here is a breakdown of what each part of the command does:
    • --stdio: This option tells perf report to use standard I/O (input/output) for displaying the report. It is the textual output mode that prints the report to the terminal (or to a file if redirected). Without this, perf report might use a TUI (Text User Interface) which provides an interactive text-based graphical interface.
    • -i ~/perf.data: The -i flag specifies the input file for the perf report command, which in this case is ~/perf.data. It tells perf report to generate the report based on the perf.data file located in the home directory.
    • --show-total-period: This option includes the "total period" in the report output, which is essentially the total time (or count) that each function (or sample point) was recorded as running during the profiling. It helps you understand how long or how often a particular function was active in relation to the total time of the collected data.
$ ./perf record -e sched:sched_stat_sleep -e sched:sched_switch  -e sched:sched_process_exit -g -o ~/perf.data.raw ~/foo
$ ./perf inject -v -s -i ~/perf.data.raw -o ~/perf.data
$ ./perf report --stdio --show-total-period -i ~/perf.data

Examples of the overhead calculation

The overhead can be shown in two columns as 'Children' and 'Self' when perf collects callchains.

  • The 'self' overhead is simply calculated by adding all period values of the entry - usually a function (symbol). This is the value that perf shows traditionally and sum of all the 'self' overhead values should be 100%.
  • The 'children' overhead is calculated by adding all period values of the child functions so that it can show the total overhead of the higher level functions even if they don't directly execute much. 'Children' here means functions that are called from another (parent) function.

It might be confusing that the sum of all the 'children' overhead values exceeds 100% since each of them is already an accumulation of 'self' overhead of its child functions. But with this enabled, users can find which function has the most overhead even if samples are spread over the children.

Consider the following example; there are three functions like below.

void foo(void) {
    /* do something */
}

void bar(void) {
    /* do something */
    foo();
}

int main(void) {
    bar()
    return 0;
}
  • In this case 'foo' is a child of 'bar', and 'bar' is an immediate child of 'main' so 'foo' also is a child of 'main'. In other words, 'main' is a parent of 'foo' and 'bar', and 'bar' is a parent of 'foo'.

  • Suppose all samples are recorded in 'foo' and 'bar' only. When it's recorded with callchains the output will show something like below in the usual (self-overhead-only) output of perf report:

    Overhead  Symbol
    ........  .....................
      60.00%  foo
    		  |
    		  --- foo
    			  bar
    			  main
    			  __libc_start_main
    
      40.00%  bar
    		  |
    		  --- bar
    			  main
    			  __libc_start_main
    
  • When the --children option is enabled, the 'self' overhead values of child functions (i.e. 'foo' and 'bar') are added to the parents to calculate the 'children' overhead. In this case the report could be displayed as:

    Children      Self  Symbol
    ........  ........  ....................
     100.00%     0.00%  __libc_start_main
    		  |
    		  --- __libc_start_main
    
     100.00%     0.00%  main
    		  |
    		  --- main
    			  __libc_start_main
    
     100.00%    40.00%  bar
    		  |
    		  --- bar
    			  main
    			  __libc_start_main
    
      60.00%    60.00%  foo
    		  |
    		  --- foo
    			  bar
    			  main
    			  __libc_start_main
    

In the above output, the 'self' overhead of 'foo' (60%) was add to the 'children' overhead of 'bar', 'main' and '__libc_start_main'. Likewise, the 'self' overhead of 'bar' (40%) was added to the 'children' overhead of 'main' and '__libc_start_main'.

So '__libc_start_main' and 'main' are shown first since they have same (100%) 'children' overhead (even though they have zero 'self' overhead) and they are the parents of 'foo' and 'bar'.

Since v3.16 the 'children' overhead is shown by default and the output is sorted by its values. The 'children' overhead is disabled by specifying --no-children option on the command line or by adding 'report.children = false' or 'top.children = false' in the perf config file.

Flame Graph

perf script -i hellotest.data | ~/Github/FlameGraph/stackcollapse-perf.pl > hellotest.perf-folded

~/Github/FlameGraph/flamegraph.pl perfhellotest.perf-folded > perfhellotest.svg