Severe performance issue regarding the stack

User forums > General (but related to Code::Blocks)

(1/2) > >>

Balazs:
Hi!

I was working on a project not so long ago, which did a huge amount of calculations. I was testing it with a small console app, when I bumped into an extremely weird performance issue.
The test app is here:

--- Code: ---#include "cnn.h"
#include <stdio.h>
#include <time.h>

int main()
{
CNN cnn;

// --- Initialize ---
cnn.A.SetSize(3,3);
cnn.B.SetSize(3,3);
cnn.SetSize(640,480);

cnn.A(0,0) = 0; cnn.A(0,1) = 0; cnn.A(0,2) = 0;
cnn.A(1,0) = 1; cnn.A(1,1) = 4; cnn.A(1,2) = 2;
cnn.A(2,0) = 0; cnn.A(2,1) = 0; cnn.A(2,2) = 0;

cnn.B(0,0) = 1; cnn.B(0,1) = 4; cnn.B(0,2) = 7;
cnn.B(1,0) = 2; cnn.B(1,1) = 5; cnn.B(1,2) = 8;
cnn.B(2,0) = 3; cnn.B(2,1) = 6; cnn.B(2,2) = 9;

cnn.Check_A();

// --- Measure performance ---
int t = clock();
cnn.maxiter = 100;
cnn.Process();
printf("%ld\n",clock()-t);

return 0;
}

--- End code ---

All it does, that it measures the performance of the cnn.Process() method.
This method makes a LOT of accesses into itself, namely accessing member variables, member structures, etc.

I was experimenting with different routines, but using the same interface, so the test console app wasn't need to be modified. I noticed, that a supposedly faster algorithm actually runs SLOWER. At first, I thought the new algorithm is just bad, BUT THAT WAS NOT THE CASE!

After a VERY long time, I was finally able to track things down, and I realized, that it is some sort of ALIGNMENT issue INSIDE THE TEST APP, and NOT in the Process() method!

If I changed the part before the initialization step above to the following:

--- Code: ---#include "cnn.h"
#include <stdio.h>
#include <time.h>

CNN cnn;

int main()
{

// --- Initialize ---

--- End code ---
then my new algorithm was faster, which I expected in the first place.

What I modified was this: I moved the declaration of the cnn OUTSIDE of the main() function, which solved the issue.

Now I would like to get some answers to this, as it is really frustrating me: how come, that a local variable inside main() makes object accesses so terribly slow. Because when I moved the declaration outside the main(), the program became (no joke) 20 TIMES faster! The problem also goes away, if I allocate the object with new, like this:

--- Code: ---#include "cnn.h"
#include <stdio.h>
#include <time.h>

CNN cnn;

int main()
{
CNN* c = new CNN();
CNN& cnn = *c;
//...

--- End code ---

The problem only arises, when I have the cnn as a local variable inside main (allocated on the stack).

The second weird thing is, that when I used a bit different CNN class (which had a few plus member variables) the problem also went away, which makes me think, that this is some sort of alignment issue, yet I cannot get it why.

I have uploaded a package of this weird thing, you can see for yourself:
http://digitus.itk.ppke.hu/~oroba/test.zip
Just type make (or mingw32-make), and you get 2 executables:
test_fast.exe
test_slow.exe
Run them from the console, and you'll see what I mean. The only difference between the 2 programs is what I described above, see main_fast.cpp and main_slow.cpp.

I also included cnn2.h as the example for what happens, when the class is a bit modified. Rename cnn2.h to cnn.h, and you'll see, that the program gets faster in the "test_slow.exe case".

The difference between cnn.h and cnn2.h is PURELY the class SIZE (number of members), nothing else!

If you have the time, take this experiment. It may also affect YOUR projects, and I think may well surprise you, as it surprised me for sure. It may cause serious performance issues.

As I do not yet know why this happens, any ideas would be welcome.

--
Greets,
Balázs

DJMaze:
By default each Windows Thread has 1MB available on the stack.

Using "CNN* c = new CNN();" you're working on the heap.

In C++, creating objects on the stack is fast because when you enter a particular scope the stack pointer is moved down once to allocate storage for all the stack-based objects created in that scope, and when you leave the scope (after all the local destructors have been called) the stack pointer is moved up once. However, creating heap objects in C++ is typically much slower because it's based on the C concept of a heap as a big pool of memory that (and this is essential) must be recycled. When you call delete in C++ the released memory leaves a hole in the heap, so when you call new, the storage allocation mechanism must go seeking to try to fit the storage for your object into any existing holes in the heap or else you'll rapidly run out of heap storage. Searching for available pieces of memory is the reason that allocating heap storage has such a performance impact in C++, so it's far faster to create stack-based objects.

However, accessing a stack object is slower then accessing a heap object inside a thread.
main() is the main threading object inside a process and every stack object inside a thread is checked on access privileges on a NT machine (think about CreateThread()). So moving your CNN object onto the process stack puts the privileges to all threads instead of just the main thread.
This could become an issue on multi-threaded applications since all thread objects have access to CNN and racing might occure (not thread safe).

I'm not 100% shure here, maybe some guru could further explain it.

DJMaze:
See:

--- Code: ---int main()
{
int t = clock();
CNN cnn; // This makes it slow

cnn.A.SetSize(3,3);
cnn.B.SetSize(3,3);
cnn.SetSize(640,480);

cnn.A(0,0) = 0; cnn.A(0,1) = 0; cnn.A(0,2) = 0;
cnn.A(1,0) = 1; cnn.A(1,1) = 4; cnn.A(1,2) = 2;
cnn.A(2,0) = 0; cnn.A(2,1) = 0; cnn.A(2,2) = 0;

cnn.B(0,0) = 1; cnn.B(0,1) = 4; cnn.B(0,2) = 7;
cnn.B(1,0) = 2; cnn.B(1,1) = 5; cnn.B(1,2) = 8;
cnn.B(2,0) = 3; cnn.B(2,1) = 6; cnn.B(2,2) = 9;

cnn.Check_A();

cnn.maxiter = 100;
cnn.Process();
printf("%ld\n",clock()-t);

return 0;
}
--- End code ---

--- Code: ---int main()
{
int t = clock();
CNN* c = new CNN(); // This also makes it fast

cnn->A.SetSize(3,3);
cnn->B.SetSize(3,3);
cnn->SetSize(640,480);

cnn->A(0,0) = 0; cnn->A(0,1) = 0; cnn->A(0,2) = 0;
cnn->A(1,0) = 1; cnn->A(1,1) = 4; cnn->A(1,2) = 2;
cnn->A(2,0) = 0; cnn->A(2,1) = 0; cnn->A(2,2) = 0;

cnn->B(0,0) = 1; cnn->B(0,1) = 4; cnn->B(0,2) = 7;
cnn->B(1,0) = 2; cnn->B(1,1) = 5; cnn->B(1,2) = 8;
cnn->B(2,0) = 3; cnn->B(2,1) = 6; cnn->B(2,2) = 9;

cnn->Check_A();

cnn->maxiter = 100;
cnn->Process();
printf("%ld\n",clock()-t);

return 0;
}
--- End code ---
NOTE: haven't test this so you maybe have to modify it a bit

MortenMacFly:
I wonder how this is related to the Code::Blocks application / development?
Don't get me wrong, but I guess if you want to have "experts" answering your question you should post to a C++ board or in the news at e.g. comp.lang.c++.
With regards, Morten.

thomas:

--- Quote ---However, accessing a stack object is slower then accessing a heap object inside a thread.
main() is the main threading object inside a process and every stack object inside a thread is checked on access privileges on a NT machine (think about CreateThread()).
--- End quote ---
Are you even certain about that?
I could not imagine how this checking could be implemented, and it does not seem to make any sense to me either, since privilegues are bound to the process.

In fact, use of stack objects is often encouraged not only because creation and destruction is by order of magnitude faster, but also because cache coherency (and thus access times) is a lot better (especially for smaller objects).
I would be really surprised if stack allocation proved to be that much slower, the only possible pitfall with it being data member alignment.
Operator new guarantees proper alignment for an object of the allocated type. The stack does not guarantee any alignment, but most compilers will align to something like sizeof(void*) nevertheless. So unless you use PODs larger than that (long long), access should be at least as fast, and sometimes faster.

Navigation

[0] Message Index

[#] Next page

Go to full version