三种Linux性能分析工具的比较 · 软件开发

https://blog.51cto.com/xiamachao/1857696 无论是在CPU设计、服务器研发还是存储系统开发的过程中，性能总是一个绕不过去的硬指标。很多时候，我们发现系统功能完备，但就是性能不尽如意，这时候就需要找到性能瓶颈、进行优化。首先我们需要结合硬件特点、操作系统和应用程序的特点深入了解系统内部的运行机制、数据流图和关键路径，最好找出核心模块、建立起抽象模型；接着需要利用各种性能分析工具，探测相关模块的热点路径、耗时统计和占比。在这方面，Linux操作系统自带了多种灵活又具有专对性的工具，此外一些厂家也开源了不少优秀的性能分析工具。下面就结合笔者最近对某服务器上IO写性能分析的过程，和大家分享一下对这三种主流工具的使用方法和感受。 1.普通应用程序性能分析的利器 gprof gprof能生成C、Pascal、Fortan77程序执行时候的调用关系profile，profile里面记录了程序每一次的内部调用关系和耗时。它要求程序在最后的链接阶段制定-pg选项，注意必须是在链接阶段而不是在编译生成.o文件的阶段，否则生成的可执行文件运行后不会生成profile,默认的profile文件名是gmou.out。比如有下面的测试文件： #include <stdio.h> #include <string.h> unsigned long f3(unsigned long cnt) { //return cnt*cnt*cnt*cnt*cnt*cnt*cnt; return cnt*cnt*cnt*cnt*cnt*cnt; } unsigned long f2(unsigned long cnt) { //return (f3(cnt) ^ f3(cnt + 1)) % cnt; return (f3(cnt) ^ f3(cnt + 1)); } unsigned long f1(unsigned long cnt) { if (cnt == 0) return 0; if (cnt == 1) return 1; return (f2(cnt-1)^f2(cnt-2)); } int main(void) { int i, cnt = 40000; for (i=0;i<cnt;i++) printf("F1(%d) = %ld\n", i, f1(i)); return 0; } 我们可以通过下面的命令生成gmon.out文件： gcc -c -o test.o test.c gcc -o -pg test test.c 执行一次之后，可以看到gmon.out已经生成，可以通过下面的命令，看到每个函数及其调用的函数时间占比和调用次数： [root@localhost backup]# gprof test gmon.out -p Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ns/call ns/call name 67.37 0.02 0.02 15××× 126.32 126.32 f3 33.68 0.03 0.01 79996 126.32 378.97 f2 0.00 0.03 0.00 40000 0.00 757.89 f1 % the percentage of the total running time of the time program used by this function. cumulative a running sum of the number of seconds accounted seconds for by this function and those listed above it. self the number of seconds accounted for by this seconds function alone. This is the major sort for this listing. calls the number of times this function was invoked, if this function is profiled, else blank. self the average number of milliseconds spent in this ms/call function per call, if this function is profiled, else blank. total the average number of milliseconds spent in this ms/call function and its descendents per call, if this function is profiled, else blank. name the name of the function. This is the minor sort for this listing. The index shows the location of the function in the gprof listing. If the index is in parenthesis it shows where it would appear in the gprof listing if it were to be printed. 当然，用户还可以用gprof test gmon.out -q或者gprof test gmon.out -l得到更详细的信息： Call graph (explanation follows) granularity: each sample hit covers 2 byte(s) for 32.99% of 0.03 seconds index % time self children called name 0.01 0.02 79996/79996 f1 [2] [1] 100.0 0.01 0.02 79996 f2 [1] 0.02 0.00 15×××/15××× f3 [4] ----------------------------------------------- 0.00 0.03 40000/40000 main [3] [2] 100.0 0.00 0.03 40000 f1 [2] 0.01 0.02 79996/79996 f2 [1] ----------------------------------------------- <spontaneous> [3] 100.0 0.00 0.03 main [3] 0.00 0.03 40000/40000 f1 [2] ----------------------------------------------- 0.02 0.00 15×××/15××× f2 [1] [4] 66.7 0.02 0.00 15××× f3 [4] 可以看到，gprof可以根据函数的调用关系，很容易得到相关函数的调用次数和时间占比。美中不足的是它对多线程支持不好，如果需要加入多线程，需要利用gprof-helper.c文件生成libgprof-helper.so： gcc -shared -fPIC gprof-helper.c -o gprof-helper.so -lpthread-ldl 然后cp gprof-helper.so /usr/lib/ /usr/lib64，最后链接到应用程序： gcc -o test.o -pg -lgprof-helper -lpthread 2.内核和应用程序性能分析的集成工具perf 除了上面提供的gprof外，google也提供了一款强大的性能分析工具perf，它被广泛应用在内核和应用程序的性能分析当中。目前大部分Linux OS 自带perf工具，使用也很简单，在应用程序执行前，需要指定统计的时间和模式，比如用CPU时间来统计函数运行时间占比： perf record -e cpu-clock --call-graph fp ./test 可以看到后面会生成perf.data，使用perf report -v -i perf.data，可以看到libc到内核函数的调用过程和执行时间统计： wKiom1fsmeOAZpa_AAG3bhEEwhM924.png 从上面的图可以看到printf()占了整体执行时间的3.8%，剩下的大部分时间都是在内核里面。 3.专业人士的瑞士××× stap - systemtap systemtap能把stap脚本语言翻译成C代码，并把这些生成的代码加入到内核模块,或者利用动态代码插桩的方式加入加入到应用程序执行的过程中，和perf一样，它也不用修改任何代码，就能得到期望的统计信息。除此之外，它还能在程序执行的过程中打印内核或者应用程序中的参数或变值。比如在内核中，我们可以用systemtap的前端工具stap写脚本来打印iSCSI驱动中的一个参数信息： [root@localhost perf]# cat d2.stp probe module("iscsi_target_mod").function("iscsi_target_locate_portal") { printf("iSCIS initiator login: %s\n", kernel_string($login->req_buf)); } 参考下面的命令运行stap脚本： [root@localhost perf]# stap d2.stp -v Pass 1: parsed user script and 117 library script(s) using 223476virt/40972res/3064shr/38364data kb, in 170usr/20sys/212real ms. Pass 2: analyzed script: 1 probe(s), 2 function(s), 0 embed(s), 0 global(s) using 225668virt/44044res/3936shr/40556data kb, in 50usr/220sys/282real ms. Pass 3: translated to C into "/tmp/stapXY9W47/stap_482377c0eab312aca9dcd34efac51d67_2620_src.c" using 225668virt/44400res/4276shr/40556data kb, in 50usr/220sys/268real ms. Pass 4: compiled C into "stap_482377c0eab312aca9dcd34efac51d67_2620.ko" in 5000usr/1120sys/6094real ms. Pass 5: starting run. iSCIS initiator login: InitiatorName=iqn.1991-05.com.microsoft:dell-pc 可以看到上面在target端打印出了initiator的IQN名称。同样我们也可以在应用程序中用到stap,还是以上面的./test程序为例，可以用下面的stap脚本来统计test执行过程中f1函数占用的时间： [root@localhost ~]# cat test.stp # path of zfs-fuse should be take care # # usage: stap test.stp -v -x `pidof test` global f1, f1_intervals; probe process("/root/test").function("f1") { f1[tid()] = gettimeofday_us() } probe process("/root/test").function("f1").return { t = gettimeofday_us() old_t = f1[tid()] if (old_t) f1_intervals <<< t - old_t delete f1[tid()] } probe end { if (@count(f1_intervals)) { printf("zvol_log_write latency min:%dus avg:%dus max:%dus count:%d\n", @min(f1_intervals), @avg(f1_intervals), @max(f1_intervals), @count(f1_intervals)) print(@hist_log(f1_intervals)); } } [root@localhost ~]# stap test.stp -v -x `pidof test` wKioL1fsmmHBe-YCAABxgCeq_Mw379.png 可以看到stap统计的精度非常之高，f1()函数执行时间在1us到17us之间。 4.三种工具的比较分析通过上面基于test.c用三种性能分析工具运行的过程和结果可以看到，gporf、perf适合对整个应用程序所有的函数进行批量处理，而stap很小微，能够轻易测量到某个函数或者某组函数的执行时间、乃至运行时修改程序里的变量。三种工具的分析比较如下：实现原理前端工具后端工具时间精度多线程支持界面友好度批量处理应用场景 gprof 链接时插桩无 gprof 粗粒度，默认0.01s 要额外链接 gprof-helper.so 有好应用程序热点分析、时间占比统计 perf 运行时插桩 perf record Perf report 细粒度，CPU clock 好较好好全系统热点定位、CPU 性能分析 stap 运行时插桩 stap systemtap 细粒度，1us 好无差特定模块、函数耗时测量、精确调优因此，在今后的性能优化工作过程中，如果碰到一个不太熟悉和了解的系统，我们可以先用perf（没有源代码的情况）、gprof(有源代码的情况）快速定位热点函数和模块，然后针对热点模块用stap进行测量相关函数的具体时间分布，按照这种自顶向下的思路能够快速找到要害、对症下药。而对于我们已经熟悉的模块或者已知的瓶颈，就可以直接用stap进行分析了，这样能提高效率。