As you probably know, I will be adding some performance counters and enhancing the profiling view of Apitrace this GSoC summer. Recently I was experimenting with existing similar software. Here are some of my notes. This is certainly not a detailed review, I was primarily focusing on profiling and applications to Apitrace.
I had a look at Intel GPA (as a part of Intel INDE ’15), AMD PerfStudio (3.2.18.0), NVIDIA Nsight (4.6) and CUDA Visual Profiler (7.0). Sadly enough, not all of these tools are available on Linux (on linux you have Visual Profiler for CUDA and AMD CodeXL for OpenCL, that is all I’ve found) and not all of them do support OpenGL profiling. Do not be surprised by the look of Windows UI and DirectX api names.
1. Intel GPA (DirectX): Platform Analyzer + Frame Analyzer
Currently supported Apis include DirectX and OpenGL ES (with linux toolkit available), but not OpenGL. Also I was told that Core OpenGL profile is being developed.
The toolset consists of several applications: Graphics Monitor, System Analyzer, Platform Analyzer, Frame Analyzer. Graphics Monitor is simply used to display HUD and collect data (one frame snapshot or a trace for some fixed duration) from applications by setting up some triggers or manually using hotkeys (which almost never work as analyzed applications trap all keystrokes). Also note, that there is no replay functionality (as in Apitrace), all performance data is collected per frame (there is actually a more fine-grained option) and at the same time timeline is made (when application is run).

In the options (screen above) you are limited to selecting only 4 counters (probably because more would not fit into HUD), but you can extend this number by using System Analyzer. This is the tool that connects to your current session and displays performance graphs in realtime. You can have arbitrary many of them, and all the data displayed is also included in the trace capture file when you save one.

Trace files can be opened in Platform Analyzer. This tool is all about timelines. Among others there are cpu/gpu frame bounds, gpu usage, DirectX tasks by thread. What is interesting here – GPU metrics are displayed as separate graphs in the timeline along other data (screens below). I think this is a nice visualization and it could be implemented in Apitrace, though I am not sure if per-call metrics can be displayed in such a way.


By the way, you can have a look at what metrics are supported here: https://software.intel.com/sites/products/documentation/gpa/15.1/win/Metrics_List_for_Intel_Graphics_Performance_Analyzers.htm
Also there is a filtering option. However, it didn’t work as I expected.

I thought it filters out everything but the objects of the type selected. So, for example, you could filter by specific call (or call types, like draw calls), and easily find them in the timeline. This also could be in Apitrace. In reality, this option grays out everything except the selected object (and objects connected to it, neighbors in the timeline), so that some numbers in the statistics pane (bottom one) are recalculated.
Frame Analyzer offers many options, but I will focus on profiling. The tool gives the ability to analyze both individual calls (ergs in GPA terminology) and group of calls (grouped by render target). Metrics collection works differently for these set of tasks (details here https://software.intel.com/sites/products/documentation/gpa/15.1/win/Analyzing_the_Time_for_Individual_Ergs_and_the_Time_for_a_Group_of_Ergs.htm). Also the set of available metrics differs from the one used in traces (I had pretty much nothing on my Intel HD 3000). Initially I was not even thinking about Apitrace as an individual frame analyzer. I do not even know if this is actually possible to implement in the current state. Anyway, I do not think I am dealing with this in the scope of my project. This could be implemented later on the basis of what is already done. For those who are interested: recently there was proposed a project of creating frame-analysis tool based on Apitrace https://github.com/janesma/apitrace/wiki/frameretrace-branch, and it is being discussed quite actively on the mailing list.
There is however a feature that can be brought to Apitrace profiling. When you change something (like shader code, directx states etc), you can reprofile and see the difference in numbers.

Similar thing can be done in Apitrace (for the cases when you change call parameters in the main interface), but I do not know whether long profile times would make it usable.
2. AMD GPU PerfStudio (OpenGL)
PerfStudio is frame debugger/profiler. You connect your application to the PerfStudio server and then work with it. In terms of profiling PerfStudio allows to capture per-draw metrics after stopping the application at specific frame.

Counters are divided into groups. I do not know how one can group the counters I am going to implement for Apitrace (for example, AMD_performance_monitor counters have internal groups), but this idea should be definitely considered. As soon as you select counters profiling takes place. It is required to be done in several passes, this fact suggests that some metrics are incompatible and cannot be taken simultaneously (am I missing something here?). Finally, the table of metrics data is displayed:

Across the top of the Data tab you can group draw calls by the so-called state bucket. This means you can group the draw calls which share the same shader, render into the same render target (or depth or stencil buffer) etc. For these grouped calls the data is aggregated (for percentages is averaged). Looks like this:

Currently Apitrace groups calls by shader program if I understand it right. Implementing other state buckets (FBO, stencil, depth) might be a good idea if it is possible at all (i.e. is this information available in Apitrace and how consistent is it in context of many frames?).
In the options you can hide data from chosen counter groups:

Might be useful.
PerfStudio also has an option for comparing two profiles:

3. NVIDIA Nsight (OpenGL)
Nsight for Windows is shipped as a Visual Studio plugin. There is also a Linux version for Eclipse, but it is a part of CUDA Toolkit and supports only CUDA api. Windows version refused to run OpenGL 3.3 application, it looks like it can stand only Opengl >= 4.2
Nsight is a frame debugger/profiler (like PerfStudio). Well written description of profiling capabilities can be found here: http://docs.nvidia.com/gameworks/index.html#developertools/desktop/frame_profiler_ogl.htm
It is worth noting that Nsight also uses state buckets and PerfKit counters allow to draw some pretty charts. I think such things would be too much for Apitrace (also it is PerfKit specific), but there should be some options to export data (then you can plot whatever you like).

4. NVIDIA Visual Profiler (CUDA)
Visual Profiler is a part of CUDA Toolkit and you can run one under Linux. It is similar to Apitrace in a sense that it also replays the whole timeline for profiling and it takes forever. The workflow is simple: you generate timeline, you collect metrics, you analyze your app. I was not able to found here anything interesting to apply to Apitrace.

