After testing out the AMD UProf tool I have to say its the best solution so far for 64bit OS (does not work on any 32bit platform).
Its GUI looks & feels like a web page guiding you through the selection of the application, set source and symbold dirs and configure the Profiler. Defaults are ok for basic check.
A good thing is that you can runs the app from the profiler, even select a delay to skip long inits and automatically close the application.
Saving a profile allows it to switch between different applications to test.
The GUI has a good structure with basic use case oriented tabs. I find it very intuitive.
When running the profiler it it starts the application and displays a conuter. After stopping it analyzes the data and in the most minimla case it
* does show a tree view with the application and
** all modules
** all threads
* a lower window with all the functions called by the selected item in the tree
* shown are different column depending on the setup. In my case it was the CPU timing only.
In my test caas I compiled a DLL with profiling and debug symbold active. I pointed the profiler to the directory with the sources, and as symbold selected the directory where C::B built the target DLL.
The result gave me what VerySleepy did not:
* All modules of the application were listed by name in the tree
* each module showed the total CPU time. Easy to compare with the total time of higher order modules or the applicatoni
** all system dependencies of each module is listed with timing
* The application function calls were IDs only (same as VerySleepy)
* The DLL functions were by name with timing in [ms]. It were wxWidget calls and system calls.
Example
wxMBConv::FromWChar(char*, unsigned int, wchar_t cont*, unsigned int) const 0.009
LeaveCriticalSection 0.003
The result showed, that function calls within my DLL were not shown, the "debug symbols" were not generated or stored elsewhere. The doc says it needs *.pdb files in windows. I have not yet figured out to get gcc produce pdb for my own code, but that should not be too difficult to find out.
All in all, great results for a quick and dirty test. Even when this is used only sparingly, its intuitive enough to get useful insights about total timing and where time is spent. By inserting "flag pole" sys calls in certain functions you could even to the analogous to "printf debugging" in profiling with this tool. If you have nothing better this is definitedly a recommendation.
Thanks BlueHazzard for the recommendation.