Concurrency, Locks & MySQL, oh my

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

scription
Forum Newbie
Posts: 1
Joined: Thu Feb 12, 2009 3:50 am

Concurrency, Locks & MySQL, oh my

Post by scription »

Hey everybody, my name's David and I'm new here... I've dabbled in PHP in the past - nothing serious.

I'm attempting to write a random-code generator that will return a 5-character code to a client (for example, '8xc55'). I'd like this code to be unique, so I'm planning on storing it in a MySQL database, along with some other related data. When the script is run, it should generate a new code, confer with the database to make sure that code doesn't already exist, and return the code. If the code does exist, the script loops until it's generated a unique code. (Note that the codes will get thrown out after a certain amount of time, so I'm not worried about using up all the possible codes.)

The problem I'm thinking I'll have is with concurrency. Say a bajillion clients access the script at the same time (I can dream, right? :)). It's foreseeable that the same 5-character code could be generated by two different instances of the script running at the same time. For this reason, I'm assuming I will need a lock around the 5-character generation code. In pseudo-code, it would look something like:

Code: Select all

 
lock();
    while (true)
    {
        newCode = generateNewCode();
        
        if (codeDoesntAlreadyExist(newCode))
            break;
    }
    
    writeCodeToDatabase(newCode);
unlock();
 
I've looked into flock() for the locking mechanism, but after reading some comments on it, it sounds like I should use something else. Anyone have any suggestions for locking mechanisms for the above pseudo-code?

Thanks a lot for any insight!

David
User avatar
sergio-pro
Forum Commoner
Posts: 88
Joined: Sat Dec 27, 2008 12:26 pm

Re: Concurrency, Locks & MySQL, oh my

Post by sergio-pro »

Hi

In Mysql Database - oriented pseudocode it will look like:

Code: Select all

 
 
$mysql->execute("START TRANSACTION");
 
while (!$found) {
  $code = generateCode();
  $rows = $mysql->select("SELECT COUNT(*) as cnt FROM mytable WHERE code='$code' FOR UPDATE;");
  IF ($rows[0]['cnt'] == 0)
    break;
}
INSERT INTO mytable (code) VALUES ($code);
$mysql->execute("COMMIT");
 
 
User avatar
pickle
Briney Mod
Posts: 6445
Joined: Mon Jan 19, 2004 6:11 pm
Location: 53.01N x 112.48W
Contact:

Re: Concurrency, Locks & MySQL, oh my

Post by pickle »

It's reassuring to see someone relatively new realize that generating an arbitrarily large, random string does not guarantee uniqueness.

Transactions don't work in MyISAM tables, which is likely the format you've got. So, you'll need to do table locking. flock() is for actual files, not databases. You'll want to start with "LOCK whatever-your-table-name-is WRITE", do your code generation, then finish with "UNLOCK TABLES".

Also, look into a do...while() loop - which is guaranteed to execute at least once & is exactly what you want to use rather than while(true){...}
Real programmers don't comment their code. If it was hard to write, it should be hard to understand.
User avatar
s.dot
Tranquility In Moderation
Posts: 5001
Joined: Sun Feb 06, 2005 7:18 pm
Location: Indiana

Re: Concurrency, Locks & MySQL, oh my

Post by s.dot »

pickle wrote:Also, look into a do...while() loop - which is guaranteed to execute at least once & is exactly what you want to use rather than while(true){...}
Yes. :)

There are a couple other precautions you could take if you are worried about a race condition happening..

Make the code field a unique index in mysql - same code can't be entered on two separate rows
Guarantee the code is unique! You could prefix your codes with time() (unique once per second) or uniqid() (ermm, i think this is unique all of the time?) but the code will be longer than 5 characters.
Set Search Time - A google chrome extension. When you search only results from the past year (or set time period) are displayed. Helps tremendously when using new technologies to avoid outdated results.
cristiano
Forum Newbie
Posts: 18
Joined: Tue Jul 01, 2008 1:44 am

Re: Concurrency, Locks & MySQL, oh my

Post by cristiano »

If you're gonna have a billion clients looking up and creating a small random string, the checking overhead will prolly kill your server.

I think instead of checking the database if it has the unique string already, you should use an algorithm that will produce unique codes each time you run the script.

I would imagine I would prolly run an auto increment and then encode the auto increment with a custom made algorithm.

This is my thought.
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: Concurrency, Locks & MySQL, oh my

Post by Eran »

cristiano has the right idea, though I wouldn't use a custom algorithm. Hashing function are exactly what you need - they generate unique strings for unique values. You can use for values something that is guranteed to be unique, such as the user ID and the result of the microtime() function. Something like:

Code: Select all

$unique = hash('sha256',microtime() . $userId); //$userId is the unique user identifier you should have in your application
This will return a 64 character unique string which you can trim to the size you need. Keep in mind though that less characters mean a higher chance of collision between values (since there are less options).

http://www.php.net/manual/en/function.hash.php
cristiano
Forum Newbie
Posts: 18
Joined: Tue Jul 01, 2008 1:44 am

Re: Concurrency, Locks & MySQL, oh my

Post by cristiano »

He's prolly looking for all this concurrency stuff because he's coding an URL shortener or something of that sort.

Instead of using SHA or any other one side encryption, its prolly easier to use a custom encoding function and then decoding it on your server itself to find the number it is referring to and then looking up that auto increment on the database would make it the fastest.
User avatar
pickle
Briney Mod
Posts: 6445
Joined: Mon Jan 19, 2004 6:11 pm
Location: 53.01N x 112.48W
Contact:

Re: Concurrency, Locks & MySQL, oh my

Post by pickle »

He has to check the DB if he wants to guarantee uniqueness. No matter what your algorithm - whether it uses a random number or the current timestamp (even with microseconds), there is still a mathematical possibility that a duplicate ID will exist. The only way to guarantee uniqueness via your script is to have a record of what's already been generated.

He won't be doing tons of lookups I'd imagine - the probability of a duplicate ID is pretty low, so he'd only be checking once, maybe twice for each ID.
Real programmers don't comment their code. If it was hard to write, it should be hard to understand.
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Concurrency, Locks & MySQL, oh my

Post by VladSun »

pytrin wrote:Hashing function are exactly what you need - they generate unique strings for unique values.
I can't agree with this one :(

http://en.wikipedia.org/wiki/Hash_function
A hash function is any well-defined procedure or mathematical function which converts a large, possibly variable-sized amount of data into a small datum, usually a single integer that may serve as an index into an array. The values returned by a hash function are called hash values, hash codes, hash sums, or simply hashes.
Injective and perfect hashing

The ideal hashing function should be injective — that is, it should map each valid input to a different hash value. Such a function would directly locate the desired entry in a hash table, without any additional search.

An injective hash function whose range is all integers between 0 and n−1, where n is the number of valid inputs, is said to be perfect. Besides providing single-step lookup, a perfect hash function also results in a compact hash table, without any vacant slots.

Unfortunately, injective and perfect hash functions exist only in very few special situations (such as mapping month names to the integers 0 to 11); and even then they are often too complicated or expensive to be of practical use. Indeed, hash functions are typically required to map a large set of valid potential inputs to a much smaller range of hash values (e.g. for mapping data items from a large set of valid values into elements of a memory array of limited size) and therefore cannot be injective.
As I mentioned in my past post (sorry, josh) hash algorithms (the useful ones) require the existence of collisions, so "they generate unique strings for unique values" is not true.

It's evident that if you have a set M(n) mapped to a set L(n) where the number of all possible values of M(n) set is larger then the number of all possible values of L(n) ser there will be always an existing "a" and "b" for which M(a) != M(b) while L(a) == L(b).
There are 10 types of people in this world, those who understand binary and those who don't
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: Concurrency, Locks & MySQL, oh my

Post by Eran »

Not sure what you are turning this into - it's not a theoretical debate on what hashing functions are (I think you already had this in another thread).

Strictly speaking - he wants to generate a unique string on demand for his clients. If he follows conventions, each client has a unique identifier (probably an auto-incrementing integer key). If you make the reasonable assumption that a client won't be asking for more than one unique string per micro-second, combining that identifier with the micro-timestamp should be enough to guarantee a unique signature. The hashing function merely takes that signature and produces a string that people like to see as unique strings - a combination of characters and numbers.

The better hashing functions (such as SHA2) have no known collisions to date. They offer 2^64 which is around 18.4 * 10^18 possibilities. I would say that would qualify as sufficient, especially when the strings that generate those unique signature have a smaller size. He can mix a database check to be absolutely sure, but I would leave transactions and locking out of it since the chance is so low that is not worth dealing with (what is called - statistically impossible).
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Concurrency, Locks & MySQL, oh my

Post by VladSun »

I simply mentioned that "they generate unique strings for unique values" statement is not true because you need a perfect hashing function. And SHA2 (or any limited value hashing function) is not a prefect one. "Statistically impossible" doesn't mean "impossible".
"(such as SHA2) have no known collisions to date" does not prove there are no collisions.

I do agree with pickle's statement: "He has to check the DB if he wants to guarantee uniqueness. No matter what your algorithm - whether it uses a random number or the current timestamp (even with microseconds), there is still a mathematical possibility that a duplicate ID will exist. The only way to guarantee uniqueness via your script is to have a record of what's already been generated." even when hashing functions are involved.
There are 10 types of people in this world, those who understand binary and those who don't
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: Concurrency, Locks & MySQL, oh my

Post by Eran »

My original point was that he shouldn't use custom hashing functions as there are better solutions available. My second point is that locking / transactions is an overkill since the chance of collision is extremely low. I don't see anything in what you wrote that refutes those arguments. What exactly is your point? would you use transactions for solving this problem?
User avatar
sergio-pro
Forum Commoner
Posts: 88
Joined: Sat Dec 27, 2008 12:26 pm

Re: Concurrency, Locks & MySQL, oh my

Post by sergio-pro »

IMO talking about hash algorithms does not make sense in this context.
If you need to generate unique string - you just generate GUID - and do not care of any collisions.

But the task is to generate FIVE SYMBOL code - so the best way I think - is custom algorithm - and database checks.
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Concurrency, Locks & MySQL, oh my

Post by VladSun »

pytrin wrote:My original point was that he shouldn't use custom hashing functions as there are better solutions available. My second point is that locking / transactions is an overkill since the chance of collision is extremely low. I don't see anything in what you wrote that refutes those arguments. What exactly is your point? would you use transactions for solving this problem?
My point is, I don't think you are using the term "hashing function" properly.
I don't see where the OP has used any "custom hashing functions". The process of generating an unique string in the time domain is not hashing.
There are 10 types of people in this world, those who understand binary and those who don't
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: Concurrency, Locks & MySQL, oh my

Post by Eran »

This is why it's important to read the posts as part of a complete conversation. I was responding to this cristiano fellow who suggested using custom functions to generate strings from the auto-incrementing identifier and I simply suggested that there hashing functions might be better used instead of creating a custom function with an unknown (probably worse) collision matrix.

I wasn't trying to start a debate on the merits of hashing functions for such purposes versus their theoretical definition.
Post Reply