Concurrency, Locks & MySQL, oh my

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Concurrency, Locks & MySQL, oh my

Post by VladSun »

pytrin wrote:This is why it's important to read the posts as part of a complete conversation. I was responding to this cristiano fellow who suggested using custom functions to generate strings from the auto-incrementing identifier and I simply suggested that there hashing functions might be better used instead of creating a custom function with an unknown (probably worse) collision matrix.

I wasn't trying to start a debate on the merits of hashing functions for such purposes versus their theoretical definition.
:)
I did follow the debate and nobody except you has mentioned hashing functions ;)

Also, I still can't agree "a standard hash function" is better to use than a custom one.
First, because it has collisions for sure (and indeed, for new hashing functions it's an "unknown collision matrix").
Second, a 1:1 function is not "a custom function with an unknown (probably worse) collision matrix" - simply, it has no collisions.

So, applying a 1:1 custom defined function to an autoincrement value is much, much better than using a hashing function.
There are 10 types of people in this world, those who understand binary and those who don't
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: Concurrency, Locks & MySQL, oh my

Post by Eran »

I suppose you believe most developers have the capability of creating a 1:1 function that returns a fixed-size string with no collisions. I most certainly don't (and it's not possible. you can only minimize collisions for fixed size string return values, which is what tried and tested hashing functions do).
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Concurrency, Locks & MySQL, oh my

Post by VladSun »

Let me say it in other words:
I think you meant encoding (which is reversible and so, it's a 1:1 function) and not hashing. Do you agree :)
While it's known that most of the encoding functions use padding when data block is shorter, I do agree (in general) that generating a fixed length encoded value from a variable length input value is a hard task to do :)

Agree? :)
There are 10 types of people in this world, those who understand binary and those who don't
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: Concurrency, Locks & MySQL, oh my

Post by Eran »

I think it is you who misunderstands the terms - you are talking about cryptographic hash functions, in which reversibility and dis-continuity are important attributes. It's true that the commonly used hashing functions (md5, sha1) are such functions. However, it's not a prerequisite of a general hashing function - it can be reversible and still be considered a hash. I invite you read in full the wikipedia page you linked earlier and read about the different types of hashing functions mentioned there.

Encoding is something else - it does not return a fixed-size string, and therefor is not vulnerable to collisions. Real 1:1 functions would not be of help for creating fixed-size strings from unknown input unless the complete range is known ahead of time.
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Concurrency, Locks & MySQL, oh my

Post by VladSun »

pytrin wrote:I think it is you who misunderstands the terms - you are talking about cryptographic hash functions, in which reversibility and dis-continuity are important attributes. It's true that the commonly used hashing functions (md5, sha1) are such functions. However, it's not a prerequisite of a general hashing function - it can be reversible and still be considered a hash. I invite you read in full the wikipedia page you linked earlier and read about the different types of hashing functions mentioned there.
VladSun wrote:
pytrin wrote:Hashing function are exactly what you need - they generate unique strings for unique values.
I can't agree with this one :(

http://en.wikipedia.org/wiki/Hash_function
A hash function is any well-defined procedure or mathematical function which converts a large, possibly variable-sized amount of data into a small datum, usually a single integer that may serve as an index into an array. The values returned by a hash function are called hash values, hash codes, hash sums, or simply hashes.
Injective and perfect hashing

The ideal hashing function should be injective — that is, it should map each valid input to a different hash value. Such a function would directly locate the desired entry in a hash table, without any additional search.

An injective hash function whose range is all integers between 0 and n−1, where n is the number of valid inputs, is said to be perfect. Besides providing single-step lookup, a perfect hash function also results in a compact hash table, without any vacant slots.

Unfortunately, injective and perfect hash functions exist only in very few special situations (such as mapping month names to the integers 0 to 11); and even then they are often too complicated or expensive to be of practical use. Indeed, hash functions are typically required to map a large set of valid potential inputs to a much smaller range of hash values (e.g. for mapping data items from a large set of valid values into elements of a memory array of limited size) and therefore cannot be injective.
As I mentioned in my past post (sorry, josh) hash algorithms (the useful ones) require the existence of collisions, so "they generate unique strings for unique values" is not true.

It's evident that if you have a set M(n) mapped to a set L(n) where the number of all possible values of M(n) set is larger then the number of all possible values of L(n) ser there will be always an existing "a" and "b" for which M(a) != M(b) while L(a) == L(b).
...
And you were proposing the usage of SHA-* ...



pytrin wrote:Encoding is something else - it does not return a fixed-size string, and therefor is not vulnerable to collisions. Real 1:1 functions would not be of help for creating fixed-size strings from unknown input unless the complete range is known ahead of time.
Let's take a 56 bit key algorithm - (e.g. DES), so a 56 bit autoincrement value will be 1:1 mapped in a fixed length output. Voila!
There are 10 types of people in this world, those who understand binary and those who don't
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: Concurrency, Locks & MySQL, oh my

Post by Eran »

It seems you always refer to only a part of what I write without taking everything in context.
Real 1:1 functions would not be of help for creating fixed-size strings from unknown input unless the complete range is known ahead of time.
I said that if the range is known ahead of time then it's possible. The OP said he wants 5-character strings, so there will be collisions if the auto-incrementing key every reaches a certain threashold, especially if he wants to generate multiple such strings for each client. Whether or not that's relevant it's hard to tell, but it's the same both for a 64-bit hash (Sha256) and a 56-bit encoding.
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Concurrency, Locks & MySQL, oh my

Post by VladSun »

Did I say you were wrong about the last one ;)

The one that started the discussion between you and me here was the "Hashing function are exactly what you need - they generate unique strings for unique values." statement of yours.
A hash function is any well-defined procedure or mathematical function which converts a large, possibly variable-sized amount of data into a small datum...
That's the behaviour everybody assumes when he/her sees "hashing". I mentioned the "perfect hashing" (though it's not fixed length of course) one as an exception of this rule, but it's not the case with any of MD5, SHA-*, etc. family hashing functions.
There are 10 types of people in this world, those who understand binary and those who don't
josh
DevNet Master
Posts: 4872
Joined: Wed Feb 11, 2004 3:23 pm
Location: Palm beach, Florida

Re: Concurrency, Locks & MySQL, oh my

Post by josh »

VladSun wrote:As I mentioned in my past post (sorry, josh) hash algorithms (the useful ones) require the existence of collisions, so "they generate unique strings for unique values" is not true.
I don't see where you're getting that definition. So in low level programming languages a requirement is that hash maps have collisions?
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Concurrency, Locks & MySQL, oh my

Post by VladSun »

In low level programming languages hashing is mainly used for building fast search algorithms. It's needed to map an unknown range of values to a smaller, fixed one range. So if we have an a sequence of strings, they are transformed into a sequence of hashes. The optimal way to do this is to have N variable length strings mapped into a successive sequence of N unique log2(N)-bit-length hashes. This way one will map these strings into the system memory and use the hash values as addresses to the elements of this "memory array", thus when the presence of a given string value have to be checked it will be done in a very efficient and fast way - just by checking whether the hashing function value exists in the memory address range (a look up table). While having this ideal hashing function is very hard (I think it's impossible) to build, there are some requirements one should be aware of when building such a mapping:
1) the hash function has to generate values with the most minimal possible length - that's because the memory allocated has to cover min(hash) to max(hash) addresses. And computer memory is finite;
2) the hash function has to generate the most minimal possible number of collisions - that's because every collision requires a call to a function to solve this collision (i.e. to put this string hash value to another address, because the current one already corresponds to an existing string value)

These two requirements are in conflict with each other and this conflict is solved by applying a function to solve the collisions, using a rehashing methods and changing the hash function length adaptively, in real time.

The fact that it's required that the set of all possible input values M is bigger than the set of all possible hash values N is enough to say that one should always expect collisions. The requirement 1) insists that you can't use N>M hashing function because it will be useless. So having the possibility of collisions become a requirement.

The words hashing and collision are rarely to be seen apart :)
There are 10 types of people in this world, those who understand binary and those who don't
Post Reply