Author Topic: patch : from std::map replace to hashmap  (Read 52109 times)

Offline Loaden

  • Lives here!
  • ****
  • Posts: 1014
Re: patch : from std::map replace to hashmap
« Reply #30 on: March 22, 2010, 05:18:25 pm »
Hi, I write a new wxStringHashMap, It's use random key, like this:
Code
#include <iostream>
#include <map>
#include <vector>
#include <stdlib.h>
#include <conio.h>
#include <time.h>

#include <wx/string.h>
#include <wx/timer.h>
#include <wx/hashmap.h>

const size_t LOOPS = 9;
const size_t TIMES = 10000000;
const wxString FINDKEY = _T("_GLIBCXX_END_NAMESPACE_TR1");

class HashForWxStringMap
{
public:
    HashForWxStringMap() {}
    unsigned long operator()(const wxString& x) const
    {
        size_t len = x.length();
        return size_t((128 ^ len) + *((unsigned int*)(x.c_str() + len - 4)));
    }
    HashForWxStringMap& operator=(const HashForWxStringMap&) { return *this; }
};

class EqualForWxStringMap
{
public:
    EqualForWxStringMap() { }
    bool operator()(const wxString& a, const wxString& b) const
    { return a == b; }

    EqualForWxStringMap& operator=(const EqualForWxStringMap&) { return *this; }
};

WX_DECLARE_HASH_MAP(wxString, wxString, HashForWxStringMap, EqualForWxStringMap, HashForWxStringMapMap);
//WX_DECLARE_STRING_HASH_MAP(wxString, HashForWxStringMapMap);

typedef std::vector<wxString> RandomString;
const wxChar CCH[] = _T("_0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_");
const int RANDOM_NUM = 1000000;

void InitRandomString(RandomString& rs)
{
    for (int i = 0; i < RANDOM_NUM + 1; ++i)
    {
        int len = 1 + rand() % (sizeof(CCH) / sizeof(wxChar) - 1);
        wxChar* buf = new wxChar[len + 1];
        for (int j = 0; j < len; ++j)
        {
            int x = rand() % (sizeof(CCH) / sizeof(wxChar) - 1);
            buf[j] = CCH[x];
        }
        buf[len] = 0;
//        std::wcout << buf << std::endl;
        rs.push_back(buf);
        delete[] buf;
    }
}

int main()
{
    srand((unsigned)time(NULL));
    std::cout << "Please wait for random string..." << std::endl;
    std::cout << std::endl;
   
    RandomString rs;
    InitRandomString(rs);
   
    std::map<wxString, wxString> stdMap;
    stdMap[_T("_GLIBCXX_STD")] = _T("std");
    stdMap[_T("_GLIBCXX_BEGIN_NESTED_NAMESPACE")] = _T("+namespace");
    stdMap[_T("_GLIBCXX_BEGIN_NAMESPACE")] = _T("+namespace");
    stdMap[_T("_GLIBCXX_END_NESTED_NAMESPACE")] = _T("}");
    stdMap[_T("_GLIBCXX_END_NAMESPACE")] = _T("}");
    stdMap[_T("_GLIBCXX_END_NAMESPACE_TR1")] = _T("}");
    stdMap[_T("_GLIBCXX_BEGIN_NAMESPACE_TR1")] = _T("-namespace tr1 {");
    stdMap[_T("_STD_BEGIN")] = _T("-namespace std {");
    stdMap[_T("_STD_END")] = _T("}");
    stdMap[_T("_GLIBCXX_STD2")] = _T("2std");
    stdMap[_T("_GLIBCXX_BEGIN_NESTED_NAMESPACE2")] = _T("2+namespace");
    stdMap[_T("_GLIBCXX_BEGIN_NAMESPACE2")] = _T("2+namespace");
    stdMap[_T("_GLIBCXX_END_NESTED_NAMESPACE2")] = _T("2}");
    stdMap[_T("_GLIBCXX_END_NAMESPACE2")] = _T("2}");
    stdMap[_T("_GLIBCXX_END_NAMESPACE_TR12")] = _T("2}");
    stdMap[_T("_GLIBCXX_BEGIN_NAMESPACE_TR12")] = _T("2-namespace tr1 {");
    stdMap[_T("_STD_BEGIN2")] = _T("2-namespace std {");
    stdMap[_T("_STD_END2")] = _T("2}");

    HashForWxStringMapMap hashMap;
    hashMap[_T("_GLIBCXX_STD")] = _T("std");
    hashMap[_T("_GLIBCXX_BEGIN_NESTED_NAMESPACE")] = _T("+namespace");
    hashMap[_T("_GLIBCXX_BEGIN_NAMESPACE")] = _T("+namespace");
    hashMap[_T("_GLIBCXX_END_NESTED_NAMESPACE")] = _T("}");
    hashMap[_T("_GLIBCXX_END_NAMESPACE")] = _T("}");
    hashMap[_T("_GLIBCXX_END_NAMESPACE_TR1")] = _T("}");
    hashMap[_T("_GLIBCXX_BEGIN_NAMESPACE_TR1")] = _T("-namespace tr1 {");
    hashMap[_T("_STD_BEGIN")] = _T("-namespace std {");
    hashMap[_T("_STD_END")] = _T("}");
    hashMap[_T("_GLIBCXX_STD2")] = _T("2std");
    hashMap[_T("_GLIBCXX_BEGIN_NESTED_NAMESPACE2")] = _T("2+namespace");
    hashMap[_T("_GLIBCXX_BEGIN_NAMESPACE2")] = _T("2+namespace");
    hashMap[_T("_GLIBCXX_END_NESTED_NAMESPACE2")] = _T("2}");
    hashMap[_T("_GLIBCXX_END_NAMESPACE2")] = _T("2}");
    hashMap[_T("_GLIBCXX_END_NAMESPACE_TR12")] = _T("2}");
    hashMap[_T("_GLIBCXX_BEGIN_NAMESPACE_TR12")] = _T("2-namespace tr1 {");
    hashMap[_T("_STD_BEGIN2")] = _T("2-namespace std {");
    hashMap[_T("_STD_END2")] = _T("2}");

    int stdTotalTime = 0;
    for (size_t i = 0; i < LOOPS; ++i)
    {
        wxStopWatch sw;
        for (size_t j = 0; j < TIMES; ++j) stdMap.find(rs.at(i % RANDOM_NUM));
        sw.Pause();
        int time = sw.Time();
        std::cout << "stdMap " << i + 1 << " : " << time << std::endl;
        stdTotalTime += time;
    }
    std::cout << std::endl;
    std::cout << "stdMap Total Time: " << stdTotalTime << std::endl;
    double stdAverage = stdTotalTime / LOOPS;
    std::cout << "stdMap Average Time: " << stdAverage << std::endl;
    std::cout << "------------------------------\n" << std::endl;

    int hashTotalTime = 0;
    for (size_t i = 0; i < LOOPS; ++i)
    {
        wxStopWatch sw;
        for (size_t j = 0; j < TIMES; ++j) hashMap.find(rs.at(i % RANDOM_NUM));
        sw.Pause();
        int time = sw.Time();
        std::cout << "hashMap " << i + 1 << " : " << time << std::endl;
        hashTotalTime += time;
    }
    std::cout << std::endl;
    std::cout << "hashMap Total Time: " << hashTotalTime << std::endl;
    double hashAverage = hashTotalTime / LOOPS;
    std::cout << "hashMap Average Time: " << hashAverage << std::endl;
    std::cout << "------------------------------\n" << std::endl;

    std::cout << "(stdAverage - hashAverage) / stdAverage = " << (stdAverage - hashAverage) / stdAverage;
    std::cout << std::endl;

    return 0;
}
Code
Please wait for random string...

stdMap 1 : 1265
stdMap 2 : 1266
stdMap 3 : 984
stdMap 4 : 1281
stdMap 5 : 1016
stdMap 6 : 1250
stdMap 7 : 828
stdMap 8 : 1234
stdMap 9 : 1125

stdMap Total Time: 10249
stdMap Average Time: 1138
------------------------------

hashMap 1 : 250
hashMap 2 : 250
hashMap 3 : 297
hashMap 4 : 250
hashMap 5 : 250
hashMap 6 : 250
hashMap 7 : 250
hashMap 8 : 266
hashMap 9 : 250

hashMap Total Time: 2313
hashMap Average Time: 257
------------------------------

(stdAverage - hashAverage) / stdAverage = 0.774165

Process returned 0 (0x0)   execution time : 16.000 s
Press any key to continue.

Offline Loaden

  • Lives here!
  • ****
  • Posts: 1014
Re: patch : from std::map replace to hashmap
« Reply #31 on: March 22, 2010, 05:24:01 pm »
If change the hash function TO:
Code
class HashForWxStringMap
{
public:
    HashForWxStringMap() {}
    unsigned long operator()(const wxString& x) const
    {
        size_t len = x.length();
        if (len >= sizeof(unsigned int) / sizeof(wxChar))
            return size_t((128 ^ len) + *((unsigned int*)(x.c_str() + len - sizeof(unsigned int) / sizeof(wxChar))));
        else if (len >= sizeof(unsigned short) / sizeof(wxChar))
            return size_t((256 ^ len) + *((unsigned short*)(x.c_str() + len - sizeof(unsigned short) / sizeof(wxChar))));
        else
            return size_t((512 ^ len) + *(x.c_str() + len - sizeof(unsigned short)));
    }
    HashForWxStringMap& operator=(const HashForWxStringMap&) { return *this; }
};

AND ONLY 9 replacement,  report here:
Code
Please wait for random string...

stdMap 1 : 609
stdMap 2 : 610
stdMap 3 : 562
stdMap 4 : 1063
stdMap 5 : 562
stdMap 6 : 610
stdMap 7 : 1047
stdMap 8 : 765
stdMap 9 : 610

stdMap Total Time: 6438
stdMap Average Time: 715
------------------------------

hashMap 1 : 250
hashMap 2 : 328
hashMap 3 : 250
hashMap 4 : 250
hashMap 5 : 250
hashMap 6 : 234
hashMap 7 : 250
hashMap 8 : 250
hashMap 9 : 250

hashMap Total Time: 2312
hashMap Average Time: 256
------------------------------

(stdAverage - hashAverage) / stdAverage = 0.641958
« Last Edit: March 22, 2010, 05:40:10 pm by Loaden »

Offline Loaden

  • Lives here!
  • ****
  • Posts: 1014
Re: patch : from std::map replace to hashmap
« Reply #32 on: March 22, 2010, 06:05:58 pm »
Here is final report:
Code
Please wait for random string...

stdMap 1 : 1032
stdMap 2 : 766
stdMap 3 : 609
stdMap 4 : 766
stdMap 5 : 578
stdMap 6 : 609
stdMap 7 : 1032
stdMap 8 : 1015
stdMap 9 : 609

stdMap Total Time: 7016
stdMap Average Time: 779
------------------------------

hashMap 1 : 250
hashMap 2 : 250
hashMap 3 : 250
hashMap 4 : 250
hashMap 5 : 250
hashMap 6 : 250
hashMap 7 : 250
hashMap 8 : 235
hashMap 9 : 250

hashMap Total Time: 2235
hashMap Average Time: 248
------------------------------

(stdAverage - hashAverage) / stdAverage = 0.681643

Process returned 0 (0x0)   execution time : 12.703 s
Press any key to continue.

DEMO (-O2):
Code
#include <iostream>
#include <map>
#include <vector>
#include <stdlib.h>
#include <conio.h>
#include <time.h>

#include <wx/string.h>
#include <wx/timer.h>
#include <wx/hashmap.h>

const size_t LOOPS = 9;
const size_t TIMES = 10000000;
const wxChar CCH[] = _T("_0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_");
const int RANDOM_NUM = 1000000;
typedef std::vector<wxString> RandomString;

class HashForWxStringMap
{
public:
    HashForWxStringMap() {}
    unsigned long operator()(const wxString& x) const
    {
        const size_t len = x.length();
        const size_t intWxChar = sizeof(unsigned int) / sizeof(wxChar);
        const size_t shortWxChar = sizeof(unsigned short) / sizeof(wxChar);
        if (len >= intWxChar)
            return size_t((128 ^ len) + *((unsigned int*)(x.c_str() + len - intWxChar)));
        else if (len >= shortWxChar)
            return size_t((256 ^ len) + *((unsigned short*)(x.c_str() + len - shortWxChar)));
        else
            return size_t((512 ^ len) + *(x.c_str() + len - 1));
    }
    HashForWxStringMap& operator=(const HashForWxStringMap&) { return *this; }
};

class EqualForWxStringMap
{
public:
    EqualForWxStringMap() { }
    bool operator()(const wxString& a, const wxString& b) const
    { return a == b; }

    EqualForWxStringMap& operator=(const EqualForWxStringMap&) { return *this; }
};

WX_DECLARE_HASH_MAP(wxString, wxString, HashForWxStringMap, EqualForWxStringMap, wxStringHashMap);

void InitRandomString(RandomString& rs)
{
    for (int i = 0; i < RANDOM_NUM + 1; ++i)
    {
        int len = 1 + rand() % (sizeof(CCH) / sizeof(wxChar) - 1);
        wxChar* buf = new wxChar[len + 1];
        for (int j = 0; j < len; ++j)
        {
            int x = rand() % (sizeof(CCH) / sizeof(wxChar) - 1);
            buf[j] = CCH[x];
        }
        buf[len] = 0;
        rs.push_back(buf);
        delete[] buf;
    }
}

int main()
{
    srand((unsigned)time(NULL));
    std::cout << "Please wait for random string..." << std::endl;
    std::cout << std::endl;
   
    RandomString rs;
    InitRandomString(rs);
   
    std::map<wxString, wxString> stdMap;
    stdMap[_T("_GLIBCXX_STD")] = _T("std");
    stdMap[_T("_GLIBCXX_BEGIN_NESTED_NAMESPACE")] = _T("+namespace");
    stdMap[_T("_GLIBCXX_BEGIN_NAMESPACE")] = _T("+namespace");
    stdMap[_T("_GLIBCXX_END_NESTED_NAMESPACE")] = _T("}");
    stdMap[_T("_GLIBCXX_END_NAMESPACE")] = _T("}");
    stdMap[_T("_GLIBCXX_END_NAMESPACE_TR1")] = _T("}");
    stdMap[_T("_GLIBCXX_BEGIN_NAMESPACE_TR1")] = _T("-namespace tr1 {");
    stdMap[_T("_STD_BEGIN")] = _T("-namespace std {");
    stdMap[_T("_STD_END")] = _T("}");
//    stdMap[_T("_GLIBCXX_STD2")] = _T("2std");
//    stdMap[_T("_GLIBCXX_BEGIN_NESTED_NAMESPACE2")] = _T("2+namespace");
//    stdMap[_T("_GLIBCXX_BEGIN_NAMESPACE2")] = _T("2+namespace");
//    stdMap[_T("_GLIBCXX_END_NESTED_NAMESPACE2")] = _T("2}");
//    stdMap[_T("_GLIBCXX_END_NAMESPACE2")] = _T("2}");
//    stdMap[_T("_GLIBCXX_END_NAMESPACE_TR12")] = _T("2}");
//    stdMap[_T("_GLIBCXX_BEGIN_NAMESPACE_TR12")] = _T("2-namespace tr1 {");
//    stdMap[_T("_STD_BEGIN2")] = _T("2-namespace std {");
//    stdMap[_T("_STD_END2")] = _T("2}");

    wxStringHashMap hashMap;
    hashMap[_T("_GLIBCXX_STD")] = _T("std");
    hashMap[_T("_GLIBCXX_BEGIN_NESTED_NAMESPACE")] = _T("+namespace");
    hashMap[_T("_GLIBCXX_BEGIN_NAMESPACE")] = _T("+namespace");
    hashMap[_T("_GLIBCXX_END_NESTED_NAMESPACE")] = _T("}");
    hashMap[_T("_GLIBCXX_END_NAMESPACE")] = _T("}");
    hashMap[_T("_GLIBCXX_END_NAMESPACE_TR1")] = _T("}");
    hashMap[_T("_GLIBCXX_BEGIN_NAMESPACE_TR1")] = _T("-namespace tr1 {");
    hashMap[_T("_STD_BEGIN")] = _T("-namespace std {");
    hashMap[_T("_STD_END")] = _T("}");
//    hashMap[_T("_GLIBCXX_STD2")] = _T("2std");
//    hashMap[_T("_GLIBCXX_BEGIN_NESTED_NAMESPACE2")] = _T("2+namespace");
//    hashMap[_T("_GLIBCXX_BEGIN_NAMESPACE2")] = _T("2+namespace");
//    hashMap[_T("_GLIBCXX_END_NESTED_NAMESPACE2")] = _T("2}");
//    hashMap[_T("_GLIBCXX_END_NAMESPACE2")] = _T("2}");
//    hashMap[_T("_GLIBCXX_END_NAMESPACE_TR12")] = _T("2}");
//    hashMap[_T("_GLIBCXX_BEGIN_NAMESPACE_TR12")] = _T("2-namespace tr1 {");
//    hashMap[_T("_STD_BEGIN2")] = _T("2-namespace std {");
//    hashMap[_T("_STD_END2")] = _T("2}");

    int stdTotalTime = 0;
    for (size_t i = 0; i < LOOPS; ++i)
    {
        wxStopWatch sw;
        for (size_t j = 0; j < TIMES; ++j) stdMap.find(rs.at(i % RANDOM_NUM));
        sw.Pause();
        int time = sw.Time();
        std::cout << "stdMap " << i + 1 << " : " << time << std::endl;
        stdTotalTime += time;
    }
    std::cout << std::endl;
    std::cout << "stdMap Total Time: " << stdTotalTime << std::endl;
    double stdAverage = stdTotalTime / LOOPS;
    std::cout << "stdMap Average Time: " << stdAverage << std::endl;
    std::cout << "------------------------------\n" << std::endl;

    int hashTotalTime = 0;
    for (size_t i = 0; i < LOOPS; ++i)
    {
        wxStopWatch sw;
        for (size_t j = 0; j < TIMES; ++j) hashMap.find(rs.at(i % RANDOM_NUM));
        sw.Pause();
        int time = sw.Time();
        std::cout << "hashMap " << i + 1 << " : " << time << std::endl;
        hashTotalTime += time;
    }
    std::cout << std::endl;
    std::cout << "hashMap Total Time: " << hashTotalTime << std::endl;
    double hashAverage = hashTotalTime / LOOPS;
    std::cout << "hashMap Average Time: " << hashAverage << std::endl;
    std::cout << "------------------------------\n" << std::endl;

    std::cout << "(stdAverage - hashAverage) / stdAverage = " << (stdAverage - hashAverage) / stdAverage;
    std::cout << std::endl;

    return 0;
}

[attachment deleted by admin]

Offline koso

  • Multiple posting newcomer
  • *
  • Posts: 58
Re: patch : from std::map replace to hashmap
« Reply #33 on: March 22, 2010, 06:10:04 pm »
@Loaden: you made the hash function quiet complicated, but i dont understand, what were you trying to do. Are you aware of sizeof(int)=sizeof(wxChar)=4 .. on 32 bit system with UNICODE wxWidgets? At this configuration, your hash function only takes last character of string.

Offline Loaden

  • Lives here!
  • ****
  • Posts: 1014
Re: patch : from std::map replace to hashmap
« Reply #34 on: March 22, 2010, 06:24:57 pm »
@Loaden: you made the hash function quiet complicated, but i dont understand, what were you trying to do. Are you aware of sizeof(int)=sizeof(wxChar)=4 .. on 32 bit system with UNICODE wxWidgets? At this configuration, your hash function only takes last character of string.
On 32 bit windows OS, sizeof(wxChar) == 2, but sizeof(int) = 4;
Yes, I only only judge the length and the last one (or two) character, Because has enough for a replacement rules.
You can look at the rules of these replacement:
Code
<TOKEN_REPLACEMENTS>
<ssmap>
<QT_BEGIN_HEADER>
<![CDATA[]]>
</QT_BEGIN_HEADER>
<QT_END_HEADER>
<![CDATA[]]>
</QT_END_HEADER>
<Q_GADGET>
<![CDATA[]]>
</Q_GADGET>
<Q_INLINE_TEMPLATE>
<![CDATA[]]>
</Q_INLINE_TEMPLATE>
<Q_OBJECT>
<![CDATA[]]>
</Q_OBJECT>
<Q_OUTOFLINE_TEMPLATE>
<![CDATA[]]>
</Q_OUTOFLINE_TEMPLATE>
<Q_PACKED>
<![CDATA[]]>
</Q_PACKED>
<Q_REQUIRED_RESULT>
<![CDATA[]]>
</Q_REQUIRED_RESULT>
<_GLIBCXX_BEGIN_NAMESPACE>
<![CDATA[+namespace]]>
</_GLIBCXX_BEGIN_NAMESPACE>
<_GLIBCXX_BEGIN_NAMESPACE_TR1>
<![CDATA[-namespace tr1 {]]>
</_GLIBCXX_BEGIN_NAMESPACE_TR1>
<_GLIBCXX_BEGIN_NESTED_NAMESPACE>
<![CDATA[+namespace]]>
</_GLIBCXX_BEGIN_NESTED_NAMESPACE>
<_GLIBCXX_END_NAMESPACE>
<![CDATA[}]]>
</_GLIBCXX_END_NAMESPACE>
<_GLIBCXX_END_NAMESPACE_TR1>
<![CDATA[}]]>
</_GLIBCXX_END_NAMESPACE_TR1>
<_GLIBCXX_END_NESTED_NAMESPACE>
<![CDATA[}]]>
</_GLIBCXX_END_NESTED_NAMESPACE>
<_GLIBCXX_STD>
<![CDATA[std]]>
</_GLIBCXX_STD>
<_STD_BEGIN>
<![CDATA[-namespace std {]]>
</_STD_BEGIN>
<_STD_END>
<![CDATA[}]]>
</_STD_END>
<emit>
<![CDATA[]]>
</emit>
</ssmap>
</TOKEN_REPLACEMENTS>

Offline thomas

  • Administrator
  • Lives here!
  • *****
  • Posts: 3979
Re: patch : from std::map replace to hashmap
« Reply #35 on: March 23, 2010, 12:18:54 pm »
I think you're thinking too complicated :)

Don't think in wxChars. All that Unicode shit only makes people's heads spin and renders code unnecessarily complex. Think in bytes. Sometimes you have to think in wxChars, but don't do it when it is not necessary. After all, wxChars are only bytes. On some systems, wxChars are several bytes, and on some they are just one byte. To a hash function it does not really matter, you can just treat it as a BLOB of size str.length() * sizeof(wxChar) bytes.

Personally, I wouldn't make it much more complicated than something like that:
unsigned long djb2mod(const unsigned char *str, unsigned int len)
{
   unsigned long hash = 9737333;

   while(len--)
      hash = 33 * hash + *str++;

   return hash;
}

class HashForWxStringMap
{
...
unsigned long operator()(const wxString& x) const { return ::djb2mod((const unsigned char*) x.data(), x.length()*sizeof(wxChar); }
};


Did not try whether what I wrote there compiles at all, hopefully there aren't too many severe mistakes in it :)
"We should forget about small efficiencies, say about 97% of the time: Premature quotation is the root of public humiliation."

Offline koso

  • Multiple posting newcomer
  • *
  • Posts: 58
Re: patch : from std::map replace to hashmap
« Reply #36 on: March 23, 2010, 01:10:06 pm »
I alsa think he complicates it too much :) There is no need to make it so complex. We have seen, that even lenght and one character are enought effective. I suggest to use only fast arithmetic operations (XOR, bit rotations).

About using bytes, I dont think it is good idea becase when strings are made of ANSI characters (in programming probably 99,9999%) then there are many empty bytes. And also it is much more effective working with native data types (32 bit system -> 32 bit integer), than with bytes. So ideal will be to convert/retype wxChar to (unsigned long) and make some arithmetic operations -> and that was used in previous versions, and also in Loadens, but there is it little lost :)

Offline thomas

  • Administrator
  • Lives here!
  • *****
  • Posts: 3979
Re: patch : from std::map replace to hashmap
« Reply #37 on: March 27, 2010, 03:12:17 pm »
About using bytes, I dont think it is good idea becase when strings are made of ANSI characters (in programming probably 99,9999%) then there are many empty bytes.
I beg to differ :)
Bytes give more iterations in the hash function which is a good thing, as it reduces the likelihood of collisions due to imperfect mixing. There will obviously still be collisions due to the laws of information theory (mapping many to few must have collisions), but those cannot be avoided.
The computional difference is a few hundred or thousand cycles for reasonably long words with a reasonably efficient hash function, which is comparable with the extra cache misses you'll get from collisions due to a less mixed hash function, or comparable with one heap allocation, which we implicitely do many million times during parsing.
"We should forget about small efficiencies, say about 97% of the time: Premature quotation is the root of public humiliation."

Offline koso

  • Multiple posting newcomer
  • *
  • Posts: 58
Re: patch : from std::map replace to hashmap
« Reply #38 on: March 27, 2010, 03:51:05 pm »
Let f be our hash function, which maps all mapped strings to a different value. When searching for some random string s, there are two possible situations:

1. f(s) is found in table (or success or collision). Searching procedure must compare mapped string with string we are trying to find.
2. f(s) is not in table -> we instanly know, that string is not mapped 

You are right, that cryptographical-strong hash function function will have less collision, but contrary to that, function will be much more complex. From my example can be clearly seen, that penalty for collision is one extra string comparission. Once comparission is faster than strong hash function, there will be no reason to use it, becase it will be slower, even where there will be much less collisions.

I havent tested it, bot my opinion is, that your hash function is more complex than string comparission. Your function has more cycles, and uses less effective operations (multiplication). Also string comparission is very fast once strings are random, becase procedure ends at first character they differ.

Offline thomas

  • Administrator
  • Lives here!
  • *****
  • Posts: 3979
Re: patch : from std::map replace to hashmap
« Reply #39 on: March 27, 2010, 10:31:12 pm »
1. f(s) is found in table (or success or collision). Searching procedure must compare mapped string with string we are trying to find.
2. f(s) is not in table -> we instanly know, that string is not mapped
2. is correct, but 1. is not. If you have collisions, there is either a linked list in the respective slot or some other scheme (linear/quadratic probing) is used to place values in different slots.
In either case, this is guaranteed to take more than one string comparison and more than one cache miss on the average. A string compare is a triple dereference, which usually means 3 cache misses if you have enough different strings. Three cache misses are approximately 1500 cpu cycles gone, and this isn't event taking the actual compare in account. You can run a lot of hash function in 1500 cycles :)

Quote
I havent tested it, bot my opinion is, that your hash function is more complex than string comparission. Your function has more cycles, and uses less effective operations (multiplication).
This function (which is a modification of a funciton written by D.J. Bernstein btw, not me -- I only replaced Wilson's 9th primeth recurrence with the 12th recurrence to flip more bits per step) is not as inefficient as you think, as the compiler will transform the multiplication to a shift and add. Some implementations "optimize" the source code to contain a shift directly, but I prefer to write out he multiplication, which is much cleaner code and exactly the same binary.
It is also not cryptographically secure, but that is not the function's intent. The funciton's intent is to hash arbitrary data nicely with few cpu cycles, and it does that quite well.
« Last Edit: March 27, 2010, 10:34:58 pm by thomas »
"We should forget about small efficiencies, say about 97% of the time: Premature quotation is the root of public humiliation."

Offline koso

  • Multiple posting newcomer
  • *
  • Posts: 58
Re: patch : from std::map replace to hashmap
« Reply #40 on: March 27, 2010, 11:11:36 pm »
2. is correct, but 1. is not. If you have collisions, there is either a linked list in the respective slot or some other scheme (linear/quadratic probing) is used to place values in different slots.
In either case, this is guaranteed to take more than one string comparison and more than one cache miss on the average. A string compare is a triple dereference, which usually means 3 cache misses if you have enough different strings. Three cache misses are approximately 1500 cpu cycles gone, and this isn't event taking the actual compare in account. You can run a lot of hash function in 1500 cycles :)

You have probably missed my assumption that mapped strings (strings stored in hash table) have all different hash values. There are most probably linked lists (called tokens), but all with only one string. This assumption is broken once user adds some custom replacing rules with "not good" strings.

And about cache I must confess I don't understand what you think with "cache" ... it is something related to wxString? I have not much  experience with wx programming, only with standard "direct memory" accessible c/c++ strings ...

And you are right, that multiplication with 33 is not as complex, as i wrote. Now I realized that 33 has hamming weight only 2. Your hash function usable as a general purpose hash function for wxString, but I still thinks It can be more effective, once hash function is optimized to C::B needs (i.e. use knowledge, that mapped string are mostly uppercase, and searched string are mostly lowercase, and at least in my case, they are much shorter).

PS: but even quiet bad hash function seems to by much faster, than used <map>. Used hash function is only detail which can be tweaked later. It will be even easier with testing on real projects, than using synthetic tests and theoretical assumptions ;)