Performance Tools Comparison Table

I am evaluating the performance tools for my own use. If you see a wrong entry, please let me know, I’ll be happy to correct it!

Test Program Number of p-threads? Number of forks? Heap usage? Stack usage? Performance Effect Extra notes MPI Compatibility Need recompiling? Portable on Linux Cache Misses CPU Time
strace Yes Yes Yes, through `brk` Yes, through `brk` Significant performance effect. iallreduce jumped from 79.99us to 238.795us Yes No Yes No No
perf Indirectly: by following the threads (similar to custom assembly) Indirectly: by following the forks (similar to custom assembly) Yes Yes but need to compile with a specific flag. See Stack Traces Link. Also, it can measure number of context switches. Check this link for all capabilities. Yes No No, depends on the CPU registers. Yes Yes
valgrind In theory yes, but couldn’t achieve in practice Yes, through results (number of output headers) Yes It reports the results, but they were wrong. I tested 110 times, and it couldn’t measure the stack size. Also, the measuremets are based on snapshots. So, maybe it just skips the part where the stack is used. Significant. Official documentation = “However, the simulations are basic and unlikely to reflect the behaviour of a modern machine” Yes No Yes Yes Yes
gdb Yes with script. Yes with script. Yes Yes Yes Yes No No No
gperftools No (couldn’t find) Indirectly yes, it creates files for every fork. Yes. No (couldn’t find) It runs a stop-the-world sampler. In other words, it periodically stops the program being profiled to collect information. libtcmalloc raises error with large allocations Yes Yes ? ? ?
pmap See extra notes See extra notes See extra notes See extra notes See extra notes I thought this might be useful. However, the problem is that we cannot use this tool while running the program, but only externally. Therefore, this becomes unpractical. ? ? ? ?
kokkos-tools No No info in repo/issues/chatgpt Didn’t measure malloc. Didn’t measure local variables. They say very low in the repository documentation. I could only find an example which uses instrumented code. I don’t know if we can make this work with manual code instrumentation. Also, from my experiments, I could only measure the memory of the kokkos calls, not others. For example, heap memory is not included in the resutls. ? ? ? ?
Custom Assembly Parser Yes Yes Yes Yes None We can write an assembly parser to virtually execute all instructions and give you the performance bottleneck. However, the results will be different than the real world experiments due to complex nature of CPUs. (but we can use this if we have a CPU simulator? or maybe some machine learning? e.g. we generate lots of assembly, and learn a model that estimates the performance based on this assembly?) ? ? ? ?
ftrace Sometimes up to 5x stated in this link ? ? ? ?

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *