Page 1 of 2
Concurrency, Locks & MySQL, oh my
Posted: Thu Feb 12, 2009 4:19 am
by scription
Hey everybody, my name's David and I'm new here... I've dabbled in PHP in the past - nothing serious.
I'm attempting to write a random-code generator that will return a 5-character code to a client (for example, '8xc55'). I'd like this code to be unique, so I'm planning on storing it in a MySQL database, along with some other related data. When the script is run, it should generate a new code, confer with the database to make sure that code doesn't already exist, and return the code. If the code does exist, the script loops until it's generated a unique code. (Note that the codes will get thrown out after a certain amount of time, so I'm not worried about using up all the possible codes.)
The problem I'm thinking I'll have is with concurrency. Say a bajillion clients access the script at the same time (I can dream, right?

). It's foreseeable that the same 5-character code could be generated by two different instances of the script running at the same time. For this reason, I'm assuming I will need a lock around the 5-character generation code. In pseudo-code, it would look something like:
Code: Select all
lock();
while (true)
{
newCode = generateNewCode();
if (codeDoesntAlreadyExist(newCode))
break;
}
writeCodeToDatabase(newCode);
unlock();
I've looked into flock() for the locking mechanism, but after reading some comments on it, it sounds like I should use something else. Anyone have any suggestions for locking mechanisms for the above pseudo-code?
Thanks a lot for any insight!
David
Re: Concurrency, Locks & MySQL, oh my
Posted: Thu Feb 12, 2009 7:29 am
by sergio-pro
Hi
In Mysql Database - oriented pseudocode it will look like:
Code: Select all
$mysql->execute("START TRANSACTION");
while (!$found) {
$code = generateCode();
$rows = $mysql->select("SELECT COUNT(*) as cnt FROM mytable WHERE code='$code' FOR UPDATE;");
IF ($rows[0]['cnt'] == 0)
break;
}
INSERT INTO mytable (code) VALUES ($code);
$mysql->execute("COMMIT");
Re: Concurrency, Locks & MySQL, oh my
Posted: Thu Feb 12, 2009 11:28 am
by pickle
It's reassuring to see someone relatively new realize that generating an arbitrarily large, random string does not guarantee uniqueness.
Transactions don't work in MyISAM tables, which is likely the format you've got. So, you'll need to do table locking. flock() is for actual files, not databases. You'll want to start with "LOCK whatever-your-table-name-is WRITE", do your code generation, then finish with "UNLOCK TABLES".
Also, look into a do...while() loop - which is guaranteed to execute at least once & is exactly what you want to use rather than while(true){...}
Re: Concurrency, Locks & MySQL, oh my
Posted: Thu Feb 12, 2009 12:45 pm
by s.dot
pickle wrote:Also, look into a do...while() loop - which is guaranteed to execute at least once & is exactly what you want to use rather than while(true){...}
Yes.
There are a couple other precautions you could take if you are worried about a race condition happening..
Make the code field a unique index in mysql - same code can't be entered on two separate rows
Guarantee the code is unique! You could prefix your codes with time() (unique once per second) or uniqid() (ermm, i think this is unique all of the time?) but the code will be longer than 5 characters.
Re: Concurrency, Locks & MySQL, oh my
Posted: Thu Feb 12, 2009 1:27 pm
by cristiano
If you're gonna have a billion clients looking up and creating a small random string, the checking overhead will prolly kill your server.
I think instead of checking the database if it has the unique string already, you should use an algorithm that will produce unique codes each time you run the script.
I would imagine I would prolly run an auto increment and then encode the auto increment with a custom made algorithm.
This is my thought.
Re: Concurrency, Locks & MySQL, oh my
Posted: Thu Feb 12, 2009 1:53 pm
by Eran
cristiano has the right idea, though I wouldn't use a custom algorithm. Hashing function are exactly what you need - they generate unique strings for unique values. You can use for values something that is guranteed to be unique, such as the user ID and the result of the microtime() function. Something like:
Code: Select all
$unique = hash('sha256',microtime() . $userId); //$userId is the unique user identifier you should have in your application
This will return a 64 character unique string which you can trim to the size you need. Keep in mind though that less characters mean a higher chance of collision between values (since there are less options).
http://www.php.net/manual/en/function.hash.php
Re: Concurrency, Locks & MySQL, oh my
Posted: Thu Feb 12, 2009 2:04 pm
by cristiano
He's prolly looking for all this concurrency stuff because he's coding an URL shortener or something of that sort.
Instead of using SHA or any other one side encryption, its prolly easier to use a custom encoding function and then decoding it on your server itself to find the number it is referring to and then looking up that auto increment on the database would make it the fastest.
Re: Concurrency, Locks & MySQL, oh my
Posted: Fri Feb 13, 2009 4:03 pm
by pickle
He has to check the DB if he wants to guarantee uniqueness. No matter what your algorithm - whether it uses a random number or the current timestamp (even with microseconds), there is still a mathematical possibility that a duplicate ID will exist. The only way to guarantee uniqueness via your script is to have a record of what's already been generated.
He won't be doing tons of lookups I'd imagine - the probability of a duplicate ID is pretty low, so he'd only be checking once, maybe twice for each ID.
Re: Concurrency, Locks & MySQL, oh my
Posted: Fri Feb 13, 2009 7:16 pm
by VladSun
pytrin wrote:Hashing function are exactly what you need - they generate unique strings for unique values.
I can't agree with this one
http://en.wikipedia.org/wiki/Hash_function
A hash function is any well-defined procedure or mathematical function which converts a large, possibly variable-sized amount of data into a small datum, usually a single integer that may serve as an index into an array. The values returned by a hash function are called hash values, hash codes, hash sums, or simply hashes.
Injective and perfect hashing
The ideal hashing function should be injective — that is, it should map each valid input to a different hash value. Such a function would directly locate the desired entry in a hash table, without any additional search.
An injective hash function whose range is all integers between 0 and n−1, where n is the number of valid inputs, is said to be perfect. Besides providing single-step lookup, a perfect hash function also results in a compact hash table, without any vacant slots.
Unfortunately, injective and perfect hash functions exist only in very few special situations (such as mapping month names to the integers 0 to 11); and even then they are often too complicated or expensive to be of practical use. Indeed, hash functions are typically required to map a large set of valid potential inputs to a much smaller range of hash values (e.g. for mapping data items from a large set of valid values into elements of a memory array of limited size) and therefore cannot be injective.
As I mentioned in my past post (sorry, josh) hash algorithms (the useful ones)
require the existence of collisions, so "they generate unique strings for unique values" is not true.
It's evident that if you have a set M(n) mapped to a set L(n) where the number of all possible values of M(n) set is larger then the number of all possible values of L(n) ser there will be always an existing "a" and "b" for which M(a) != M(b) while L(a) == L(b).
Re: Concurrency, Locks & MySQL, oh my
Posted: Fri Feb 13, 2009 7:29 pm
by Eran
Not sure what you are turning this into - it's not a theoretical debate on what hashing functions are (I think you already had this in another thread).
Strictly speaking - he wants to generate a unique string on demand for his clients. If he follows conventions, each client has a unique identifier (probably an auto-incrementing integer key). If you make the reasonable assumption that a client won't be asking for more than one unique string per micro-second, combining that identifier with the micro-timestamp should be enough to guarantee a unique signature. The hashing function merely takes that signature and produces a string that people like to see as unique strings - a combination of characters and numbers.
The better hashing functions (such as SHA2) have no known collisions to date. They offer 2^64 which is around 18.4 * 10^18 possibilities. I would say that would qualify as sufficient, especially when the strings that generate those unique signature have a smaller size. He can mix a database check to be absolutely sure, but I would leave transactions and locking out of it since the chance is so low that is not worth dealing with (what is called - statistically impossible).
Re: Concurrency, Locks & MySQL, oh my
Posted: Fri Feb 13, 2009 7:55 pm
by VladSun
I simply mentioned that "they generate unique strings for unique values" statement is not true because you need a perfect hashing function. And SHA2 (or any limited value hashing function) is not a prefect one. "Statistically impossible" doesn't mean "impossible".
"(such as SHA2) have no known collisions to date" does not prove there are no collisions.
I do agree with pickle's statement: "He has to check the DB if he wants to guarantee uniqueness. No matter what your algorithm - whether it uses a random number or the current timestamp (even with microseconds), there is still a mathematical possibility that a duplicate ID will exist. The only way to guarantee uniqueness via your script is to have a record of what's already been generated." even when hashing functions are involved.
Re: Concurrency, Locks & MySQL, oh my
Posted: Fri Feb 13, 2009 8:01 pm
by Eran
My original point was that he shouldn't use custom hashing functions as there are better solutions available. My second point is that locking / transactions is an overkill since the chance of collision is extremely low. I don't see anything in what you wrote that refutes those arguments. What exactly is your point? would you use transactions for solving this problem?
Re: Concurrency, Locks & MySQL, oh my
Posted: Sat Feb 14, 2009 11:35 am
by sergio-pro
IMO talking about hash algorithms does not make sense in this context.
If you need to generate unique string - you just generate GUID - and do not care of any collisions.
But the task is to generate FIVE SYMBOL code - so the best way I think - is custom algorithm - and database checks.
Re: Concurrency, Locks & MySQL, oh my
Posted: Sat Feb 14, 2009 1:49 pm
by VladSun
pytrin wrote:My original point was that he shouldn't use custom hashing functions as there are better solutions available. My second point is that locking / transactions is an overkill since the chance of collision is extremely low. I don't see anything in what you wrote that refutes those arguments. What exactly is your point? would you use transactions for solving this problem?
My point is, I don't think you are using the term "hashing function" properly.
I don't see where the OP has used any "custom hashing functions". The process of generating an unique string in the time domain is not hashing.
Re: Concurrency, Locks & MySQL, oh my
Posted: Sat Feb 14, 2009 1:56 pm
by Eran
This is why it's important to read the posts as part of a complete conversation. I was responding to this cristiano fellow who suggested using custom functions to generate strings from the auto-incrementing identifier and I simply suggested that there hashing functions might be better used instead of creating a custom function with an unknown (probably worse) collision matrix.
I wasn't trying to start a debate on the merits of hashing functions for such purposes versus their theoretical definition.