Developer forums (C::B DEVELOPMENT STRICTLY!) > CodeCompletion redesign

patch : from std::map replace to hashmap

<< < (8/9) > >>

thomas:
I think you're thinking too complicated :)

Don't think in wxChars. All that Unicode shit only makes people's heads spin and renders code unnecessarily complex. Think in bytes. Sometimes you have to think in wxChars, but don't do it when it is not necessary. After all, wxChars are only bytes. On some systems, wxChars are several bytes, and on some they are just one byte. To a hash function it does not really matter, you can just treat it as a BLOB of size str.length() * sizeof(wxChar) bytes.

Personally, I wouldn't make it much more complicated than something like that:
unsigned long djb2mod(const unsigned char *str, unsigned int len)
{
   unsigned long hash = 9737333;

   while(len--)
      hash = 33 * hash + *str++;

   return hash;
}

class HashForWxStringMap
{
...
unsigned long operator()(const wxString& x) const { return ::djb2mod((const unsigned char*) x.data(), x.length()*sizeof(wxChar); }
};

Did not try whether what I wrote there compiles at all, hopefully there aren't too many severe mistakes in it :)

koso:
I alsa think he complicates it too much :) There is no need to make it so complex. We have seen, that even lenght and one character are enought effective. I suggest to use only fast arithmetic operations (XOR, bit rotations).

About using bytes, I dont think it is good idea becase when strings are made of ANSI characters (in programming probably 99,9999%) then there are many empty bytes. And also it is much more effective working with native data types (32 bit system -> 32 bit integer), than with bytes. So ideal will be to convert/retype wxChar to (unsigned long) and make some arithmetic operations -> and that was used in previous versions, and also in Loadens, but there is it little lost :)

thomas:

--- Quote from: koso on March 23, 2010, 01:10:06 pm ---About using bytes, I dont think it is good idea becase when strings are made of ANSI characters (in programming probably 99,9999%) then there are many empty bytes.
--- End quote ---
I beg to differ :)
Bytes give more iterations in the hash function which is a good thing, as it reduces the likelihood of collisions due to imperfect mixing. There will obviously still be collisions due to the laws of information theory (mapping many to few must have collisions), but those cannot be avoided.
The computional difference is a few hundred or thousand cycles for reasonably long words with a reasonably efficient hash function, which is comparable with the extra cache misses you'll get from collisions due to a less mixed hash function, or comparable with one heap allocation, which we implicitely do many million times during parsing.

koso:
Let f be our hash function, which maps all mapped strings to a different value. When searching for some random string s, there are two possible situations:

1. f(s) is found in table (or success or collision). Searching procedure must compare mapped string with string we are trying to find.
2. f(s) is not in table -> we instanly know, that string is not mapped 

You are right, that cryptographical-strong hash function function will have less collision, but contrary to that, function will be much more complex. From my example can be clearly seen, that penalty for collision is one extra string comparission. Once comparission is faster than strong hash function, there will be no reason to use it, becase it will be slower, even where there will be much less collisions.

I havent tested it, bot my opinion is, that your hash function is more complex than string comparission. Your function has more cycles, and uses less effective operations (multiplication). Also string comparission is very fast once strings are random, becase procedure ends at first character they differ.

thomas:

--- Quote from: koso on March 27, 2010, 03:51:05 pm ---1. f(s) is found in table (or success or collision). Searching procedure must compare mapped string with string we are trying to find.
2. f(s) is not in table -> we instanly know, that string is not mapped
--- End quote ---
2. is correct, but 1. is not. If you have collisions, there is either a linked list in the respective slot or some other scheme (linear/quadratic probing) is used to place values in different slots.
In either case, this is guaranteed to take more than one string comparison and more than one cache miss on the average. A string compare is a triple dereference, which usually means 3 cache misses if you have enough different strings. Three cache misses are approximately 1500 cpu cycles gone, and this isn't event taking the actual compare in account. You can run a lot of hash function in 1500 cycles :)


--- Quote ---I havent tested it, bot my opinion is, that your hash function is more complex than string comparission. Your function has more cycles, and uses less effective operations (multiplication).
--- End quote ---
This function (which is a modification of a funciton written by D.J. Bernstein btw, not me -- I only replaced Wilson's 9th primeth recurrence with the 12th recurrence to flip more bits per step) is not as inefficient as you think, as the compiler will transform the multiplication to a shift and add. Some implementations "optimize" the source code to contain a shift directly, but I prefer to write out he multiplication, which is much cleaner code and exactly the same binary.
It is also not cryptographically secure, but that is not the function's intent. The funciton's intent is to hash arbitrary data nicely with few cpu cycles, and it does that quite well.

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version