Author Topic: Severe performance issue regarding the stack  (Read 6208 times)

Balazs

  • Guest
Severe performance issue regarding the stack
« on: March 18, 2006, 10:32:32 pm »
Hi!

I was working on a project not so long ago, which did a huge amount of calculations. I was testing it with a small console app, when I bumped into an extremely weird performance issue.
The test app is here:

Code
#include "cnn.h"
#include <stdio.h>
#include <time.h>

int main()
{
CNN cnn;

// --- Initialize ---
cnn.A.SetSize(3,3);
cnn.B.SetSize(3,3);
cnn.SetSize(640,480);

cnn.A(0,0) = 0; cnn.A(0,1) = 0; cnn.A(0,2) = 0;
cnn.A(1,0) = 1; cnn.A(1,1) = 4; cnn.A(1,2) = 2;
cnn.A(2,0) = 0; cnn.A(2,1) = 0; cnn.A(2,2) = 0;

cnn.B(0,0) = 1; cnn.B(0,1) = 4; cnn.B(0,2) = 7;
cnn.B(1,0) = 2; cnn.B(1,1) = 5; cnn.B(1,2) = 8;
cnn.B(2,0) = 3; cnn.B(2,1) = 6; cnn.B(2,2) = 9;

cnn.Check_A();

// --- Measure performance ---
int t = clock();
cnn.maxiter = 100;
cnn.Process();
printf("%ld\n",clock()-t);

return 0;
}

All it does, that it measures the performance of the cnn.Process() method.
This method makes a LOT of accesses into itself, namely accessing member variables, member structures, etc.

I was experimenting with different routines, but using the same interface, so the test console app wasn't need to be modified. I noticed, that a supposedly faster algorithm actually runs SLOWER. At first, I thought the new algorithm is just bad, BUT THAT WAS NOT THE CASE!

After a VERY long time, I was finally able to track things down, and I realized, that it is some sort of ALIGNMENT issue INSIDE THE TEST APP, and NOT in the Process() method!

If I changed the part before the initialization step above to the following:
Code
#include "cnn.h"
#include <stdio.h>
#include <time.h>

CNN cnn;

int main()
{

// --- Initialize ---
then my new algorithm was faster, which I expected in the first place.

What I modified was this: I moved the declaration of the cnn OUTSIDE of the main() function, which solved the issue.

Now I would like to get some answers to this, as it is really frustrating me: how come, that a local variable inside main() makes object accesses so terribly slow. Because when I moved the declaration outside the main(), the program became (no joke) 20 TIMES faster! The problem also goes away, if I allocate the object with new, like this:

Code
#include "cnn.h"
#include <stdio.h>
#include <time.h>

CNN cnn;

int main()
{
CNN* c = new CNN();
CNN& cnn = *c;
//...

The problem only arises, when I have the cnn as a local variable inside main (allocated on the stack).

The second weird thing is, that when I used a bit different CNN class (which had a few plus member variables) the problem also went away, which makes me think, that this is some sort of alignment issue, yet I cannot get it why.

I have uploaded a package of this weird thing, you can see for yourself:
http://digitus.itk.ppke.hu/~oroba/test.zip
Just type make (or mingw32-make), and you get 2 executables:
test_fast.exe
test_slow.exe
Run them from the console, and you'll see what I mean. The only difference between the 2 programs is what I described above, see main_fast.cpp and main_slow.cpp.

I also included cnn2.h as the example for what happens, when the class is a bit modified. Rename cnn2.h to cnn.h, and you'll see, that the program gets faster in the "test_slow.exe case".

The difference between cnn.h and cnn2.h is PURELY the class SIZE (number of members), nothing else!

If you have the time, take this experiment. It may also affect YOUR projects, and I think may well surprise you, as it surprised me for sure. It may cause serious performance issues.

As I do not yet know why this happens, any ideas would be welcome.

--
Greets,
  Balázs

DJMaze

  • Guest
Re: Severe performance issue regarding the stack
« Reply #1 on: March 19, 2006, 05:48:37 pm »
By default each Windows Thread has 1MB available on the stack.

Using "CNN* c = new CNN();" you're working on the heap.

In C++, creating objects on the stack is fast because when you enter a particular scope the stack pointer is moved down once to allocate storage for all the stack-based objects created in that scope, and when you leave the scope (after all the local destructors have been called) the stack pointer is moved up once. However, creating heap objects in C++ is typically much slower because it's based on the C concept of a heap as a big pool of memory that (and this is essential) must be recycled. When you call delete in C++ the released memory leaves a hole in the heap, so when you call new, the storage allocation mechanism must go seeking to try to fit the storage for your object into any existing holes in the heap or else you'll rapidly run out of heap storage. Searching for available pieces of memory is the reason that allocating heap storage has such a performance impact in C++, so it's far faster to create stack-based objects.

However, accessing a stack object is slower then accessing a heap object inside a thread.
main() is the main threading object inside a process and every stack object inside a thread is checked on access privileges on a NT machine (think about CreateThread()). So moving your CNN object onto the process stack puts the privileges to all threads instead of just the main thread.
This could become an issue on multi-threaded applications since all thread objects have access to CNN and racing might occure (not thread safe).

I'm not 100% shure here, maybe some guru could further explain it.

DJMaze

  • Guest
Re: Severe performance issue regarding the stack
« Reply #2 on: March 19, 2006, 06:09:20 pm »
See:
Code
int main()
{
int t = clock();
CNN cnn;            // This makes it slow

cnn.A.SetSize(3,3);
cnn.B.SetSize(3,3);
cnn.SetSize(640,480);

cnn.A(0,0) = 0; cnn.A(0,1) = 0; cnn.A(0,2) = 0;
cnn.A(1,0) = 1; cnn.A(1,1) = 4; cnn.A(1,2) = 2;
cnn.A(2,0) = 0; cnn.A(2,1) = 0; cnn.A(2,2) = 0;

cnn.B(0,0) = 1; cnn.B(0,1) = 4; cnn.B(0,2) = 7;
cnn.B(1,0) = 2; cnn.B(1,1) = 5; cnn.B(1,2) = 8;
cnn.B(2,0) = 3; cnn.B(2,1) = 6; cnn.B(2,2) = 9;

cnn.Check_A();

cnn.maxiter = 100;
cnn.Process();
printf("%ld\n",clock()-t);

return 0;
}

Code
int main()
{
int t = clock();
CNN* c = new CNN(); // This also makes it fast

cnn->A.SetSize(3,3);
cnn->B.SetSize(3,3);
cnn->SetSize(640,480);

cnn->A(0,0) = 0; cnn->A(0,1) = 0; cnn->A(0,2) = 0;
cnn->A(1,0) = 1; cnn->A(1,1) = 4; cnn->A(1,2) = 2;
cnn->A(2,0) = 0; cnn->A(2,1) = 0; cnn->A(2,2) = 0;

cnn->B(0,0) = 1; cnn->B(0,1) = 4; cnn->B(0,2) = 7;
cnn->B(1,0) = 2; cnn->B(1,1) = 5; cnn->B(1,2) = 8;
cnn->B(2,0) = 3; cnn->B(2,1) = 6; cnn->B(2,2) = 9;

cnn->Check_A();

cnn->maxiter = 100;
cnn->Process();
printf("%ld\n",clock()-t);

return 0;
}
NOTE: haven't test this so you maybe have to modify it a bit

Offline MortenMacFly

  • Administrator
  • Lives here!
  • *****
  • Posts: 9694
Re: Severe performance issue regarding the stack
« Reply #3 on: March 19, 2006, 07:05:49 pm »
I wonder how this is related to the Code::Blocks application / development?
Don't get me wrong, but I guess if you want to have "experts" answering your question you should post to a C++ board or in the news at e.g. comp.lang.c++.
With regards, Morten.
Compiler logging: Settings->Compiler & Debugger->tab "Other"->Compiler logging="Full command line"
C::B Manual: https://www.codeblocks.org/docs/main_codeblocks_en.html
C::B FAQ: https://wiki.codeblocks.org/index.php?title=FAQ

Offline thomas

  • Administrator
  • Lives here!
  • *****
  • Posts: 3979
Re: Severe performance issue regarding the stack
« Reply #4 on: March 19, 2006, 07:22:39 pm »
Quote
However, accessing a stack object is slower then accessing a heap object inside a thread.
main() is the main threading object inside a process and every stack object inside a thread is checked on access privileges on a NT machine (think about CreateThread()).
Are you even certain about that?
I could not imagine how this checking could be implemented, and it does not seem to make any sense to me either, since privilegues are bound to the process.

In fact, use of stack objects is often encouraged not only because creation and destruction is by order of magnitude faster, but also because cache coherency (and thus access times) is a lot better (especially for smaller objects).
I would be really surprised if stack allocation proved to be that much slower, the only possible pitfall with it being data member alignment.
Operator new guarantees proper alignment for an object of the allocated type. The stack does not guarantee any alignment, but most compilers will align to something like sizeof(void*) nevertheless. So unless you use PODs larger than that (long long), access should be at least as fast, and sometimes faster.
"We should forget about small efficiencies, say about 97% of the time: Premature quotation is the root of public humiliation."

Balazs

  • Guest
Re: Severe performance issue regarding the stack
« Reply #5 on: March 20, 2006, 09:06:26 pm »
Yes, it IS some alignment issue. Because if I change the class definition (simply by adding surplus int member variables) the running time also changes within quite a big scale (from being very fast to very slow), but ONLY in the case, when the class is allocated on the stack, all other cases are not affected.

An even stranger thing is, that Visual Studio 2003 Toolkit, GCC 3.4.5, GCC 4.0.2 ALL produce the same results!

--
Greets,
Balázs
« Last Edit: March 20, 2006, 09:08:14 pm by Balazs »

Balazs

  • Guest
solved
« Reply #6 on: July 06, 2007, 09:16:06 pm »
I'm truly ashamed. :(

All this problem was because of uninitialized data! I didn't initialize one of the array of floats, and later, an if() was based on the results. After doing proper initialization, the problem went away.

I'm sorry for this disturbance, I just wanted others reading this topic know, what the real cause of the problem was, so it would not lead them to false assumptions.

So again: the problem has absolutely nothing to do with stacks or heaps, it was all my fault of not initializing data members properly.

--
Greets,
B.